Apache Beam

Apache Beam 2.70.0

Tue, 16 Dec 2025 15:00:00 -0500

We are happy to present the new 2.70.0 release of Beam. This release includes both improvements and new functionality. See the download page for this release.

For more information on changes in 2.70.0, check out the detailed release notes.

Highlights

Flink 1.20 support added (#32647).

New Features / Improvements

Python examples added for Milvus search enrichment handler on Beam Website including jupyter notebook example (Python) (#36176).
Milvus sink I/O connector added (Python) (#36702). Now Beam has full support for Milvus integration including Milvus enrichment and sink operations.

Breaking Changes

(Python) Some Python dependencies have been split out into extras. To ensure all previously installed dependencies are installed, when installing Beam you can pip install apache-beam[gcp,interactive,yaml,redis,hadoop,tfrecord], though most users will not need all of these extras (#34554).

Deprecations

(Python) Python 3.9 reached EOL in October 2025 and support for the language version has been removed. (#36665).

List of Contributors

According to git shortlog, the following people contributed to the 2.70.0 release. Thank you to all contributors!

Abdelrahman Ibrahim, Ahmed Abualsaud, Alex Chermenin, Andrew Crites, Arun Pandian, Celeste Zeng, Chamikara Jayalath, Chenzo, Claire McGinty, Danny McCormick, Derrick Williams, Dustin Rhodes, Enrique Calderon, Ian Liao, Jack McCluskey, Jessica Hsiao, Joey Tran, Karthik Talluri, Kenneth Knowles, Maciej Szwaja, Mehdi.D, Mohamed Awnallah, Praneet Nadella, Radek Stankiewicz, RadosÅ‚aw Stankiewicz, Reuven Lax, RuiLong J., S. VeyriÃ©, Sam Whittle, Shunping Huang, Stephan Hoyer, Steven van Rossum, Tanu Sharma, Tarun Annapareddy, Tom Stepp, Valentyn Tymofieiev, Vitaly Terentyev, XQ Hu, Yi Hu, changliiu, claudevdm, fozzie15, kristynsmith, wolfchris-google

Apache Beam 2.69.0

Tue, 28 Oct 2025 15:00:00 -0500

We are happy to present the new 2.69.0 release of Beam. This release includes both improvements and new functionality. See the download page for this release.

For more information on changes in 2.69.0, check out the detailed release notes.

Highlights

(Python) Add YAML Editor and Visualization Panel (#35772).
(Java) Java 25 Support (#35627).

I/Os

Upgraded Iceberg dependency to 1.10.0 (#36123).

New Features / Improvements

Enhance JAXBCoder with XMLInputFactory support (Java) (#36446).
Python examples added for CloudSQL enrichment handler on Beam website (Python) (#35473).
Support for batch mode execution in WriteToPubSub transform added (Python) (#35990).
Added official support for Python 3.13 (#34869).
Added an optional output_schema verification to all YAML transforms (#35952).
Support for encryption when using GroupByKey added, along with --gbek pipeline option to automatically replace all GroupByKey transforms (Java/Python) (#36214).

Breaking Changes

(Python) dill is no longer a required, default dependency for Apache Beam (#21298).
- This change only affects pipelines that explicitly use the pickle_library=dill pipeline option.
- While dill==0.3.1.1 is still pre-installed on the official Beam SDK base images, it is no longer a direct dependency of the apache-beam Python package. This means it can be overridden by other dependencies in your environment.
- If your pipeline uses pickle_library=dill, you must manually ensure dill==0.3.1.1 is installed in both your submission and runtime environments.
  - Submission environment: Install the dill extra in your local environment pip install apache-beam[gcpdill].
  - Runtime (worker) environment: Your action depends on how you manage your worker’s environment.
    - If using default containers or custom containers with the official Beam base image e.g. FROM apache/beam_python3.10_sdk:2.69.0
      - Add dill==0.3.1.1 to your worker’s requirements file (e.g., requirements.txt)
      - Pass this file to your pipeline using the –requirements_file requirements.txt pipeline option (For more details see managing Dataflow dependencies).
    - If custom containers with a non-Beam base image e.g. FROM python:3.9-slim
      - Install apache-beam with the dill extra in your docker file e.g. RUN pip install --no-cache-dir apache-beam[gcp,dill]
- If there is a dill version mismatch between submission and runtime environments you might encounter unpickling errors like Can't get attribute '_create_code' on <module 'dill._dill' from....
- If dill is not installed in the runtime environment you will see the error ImportError: Pipeline option pickle_library=dill is set, but dill is not installed...
- Report any issues you encounter when using pickle_library=dill to the GitHub issue (#21298)
(Python) Added a pickle_library=dill_unsafe pipeline option. This allows overriding dill==0.3.1.1 using dill as the pickle_library. Use with extreme caution. Other versions of dill has not been tested with Apache Beam (#21298).
(Python) The deterministic fallback coder for complex types like NamedTuple, Enum, and dataclasses now normalizes filepaths for better determinism guarantees. This affects streaming pipelines updating from 2.68 to 2.69 that utilize this fallback coder. If your pipeline is affected, you may see a warning like: “Using fallback deterministic coder for type X…”. To update safely sepcify the pipeline option --update_compatibility_version=2.68.0 (#36345).
(Python) Fixed transform naming conflict when executing DataTransform on a dictionary of PColls (#30445). This may break update compatibility if you don’t provide a --transform_name_mapping.
Removed deprecated Hadoop versions (2.10.2 and 3.2.4) that are no longer supported for Iceberg from IcebergIO (#36282).
(Go) Coder construction on SDK side is more faithful to the specs from runners without stripping length-prefix. This may break streaming pipeline update as the underlying coder could be changed (#36387).
Minimum Go version for Beam Go updated to 1.25.2 (#36461).
(Java) DoFn OutputReceiver now requires implementing a builder method as part of extended metadata support for elements (#34902).
(Java) Removed ProcessContext outputWindowedValue introduced in 2.68 that allowed setting offset and record Id. Use OutputReceiver’s builder to set those field (#36523).

Bugfixes

Fixed passing of pipeline options to x-lang transforms when called from the Java SDK (Java) (#36443).
PulsarIO has now changed support status from incomplete to experimental. Both read and writes should now minimally function (un-partitioned topics, without schema support, timestamp ordered messages for read) (Java) (#36141).
Fixed Spanner Change Stream reading stuck issue due to watermark of partition moving backwards (#36470).

List of Contributors

According to git shortlog, the following people contributed to the 2.69.0 release. Thank you to all contributors!

Abdelrahman Ibrahim, Ahmed Abualsaud, Andrew Crites, Arun Pandian, Bryan Dang, Chamikara Jayalath, Charles Nguyen, Chenzo, Clay Johnson, Danny McCormick, David A, Derrick Williams, Enrique Calderon, Hai Joey Tran, Ian Liao, Ian Mburu, Jack McCluskey, Jiang Zhu, Joey Tran, Kenneth Knowles, Kyle Stanley, Maciej Szwaja, Minbo Bae, Mohamed Awnallah, Radek Stankiewicz, RadosÅ‚aw Stankiewicz, Razvan Culea, Reuven Lax, Sagnik Ghosh, Sam Whittle, Shunping Huang, Steven van Rossum, Talat UYARER, Tanu Sharma, Tarun Annapareddy, Tom Stepp, Valentyn Tymofieiev, Vitaly Terentyev, XQ Hu, Yi Hu, Yilei, claudevdm, flpablo, fozzie15, johnjcasey, lim1t, parveensania, yashu

Google Summer of Code 2025 - Enhanced Interactive Pipeline Development Environment for JupyterLab

Tue, 14 Oct 2025 00:00:00 +0800

GSoC 2025 Basic Information

Student: [Canyu Chen] (@Chenzo1001) Mentors: [XQ Hu] (@liferoad) Organization: [Apache Beam] Proposal Link: Here

Project Overview

BeamVision significantly enhances the Apache Beam development experience within JupyterLab by providing a unified, visual interface for pipeline inspection and analysis. This project successfully delivered a production-ready JupyterLab extension that replaces fragmented workflows with an integrated workspace, featuring a dynamic side panel for pipeline visualization and a multi-tab interface for comparative workflow analysis.

Core Achievements:

Modernized Extension: Upgraded the JupyterLab Sidepanel to v4.x, ensuring compatibility with the latest ecosystem and releasing the package on both NPM and PyPI.

YAML Visualization Suite: Implemented a powerful visual editor for Beam YAML, combining a code editor, an interactive flow chart (built with @xyflow/react-flow), and a collapsible key-value panel for intuitive pipeline design.

Enhanced Accessibility & Stability: Added pip installation support and fixed critical bugs in Interactive Beam, improving stability and user onboarding.

Community Engagement: Active participation in the Beam community, including contributing to a hackathon project and successfully integrating all work into the Apache Beam codebase via merged Pull Requests.

Development Workflow

As early as the beginning of March, I saw Apache’s project information on the official GSoC website and came across Beam among the projects released by Apache. Since I have some interest in front-end development and wanted to truly integrate into the open-source community for development work, I contacted mentor XQ Hu via email and received positive feedback from him. In April, XQ Hu posted notes for all GSoC students on the Beam Mailing List. It was essential to keep an eye on the Mailing List promptly. Between March and May, besides completing the project proposal and preparation work, I also used my spare time to partially migrate the Beam JupyterLab Extension to version 4.0. This helped me get into the development state more quickly.

I also participated in the Beam Hackathon held in May. There were several topics to choose from, and I opted for the free topic. This allowed me to implement any innovative work on Beam. I combined Beam and GCP to create an Automatic Emotion Analysis Tool for comments. This tool integrates Beam Pipeline, Flink, Docker, and GCP to collect and perform sentiment analysis on real-time comment stream data, storing the results in GCP’s BigQuery. This is a highly meaningful task because sentiment analysis of comments can help businesses better understand users’ opinions about their products, thereby improving the products more effectively. However, the time during the Hackathon was too tight, so I haven’t fully completed this project yet, and it can be further improved later. This Hackathon gave me a deeper understanding of Beam and GCP, and also enhanced my knowledge of the development of the Beam JupyterLab Extension.

In June, I officially started the project development and maintained close communication with my mentor to ensure the project progressed smoothly. XQ Hu and I held a half-hour weekly meeting every Monday on Google Meet, primarily to address issues encountered during the previous week’s development and to discuss the tasks for the upcoming week. XQ Hu is an excellent mentor, and I had no communication barriers with him whatsoever. He is also very understanding; sometimes, when I needed to postpone some development tasks due to personal reasons, he was always supportive and gave me ample freedom. During this month, I improved the plugin to make it fully compatible with JupyterLab 4.0.

In July and August, I made some modifications to the plugin’s source code structure and published it on PyPI to facilitate user installation and promote the plugin. During this period, I also fixed several bugs. Afterwards, I began developing a new feature: the YAML visual editor (design doc HERE). This feature is particularly meaningful because Beam’s Pipeline is described through YAML files, and a visual editor for YAML files can significantly improve developers’ efficiency. In July, I published the proposal for the YAML visual editor and, after gathering feedback from the community for some time, started working on its development. Initially, I planned to use native Cytoscape to build the plugin from scratch, but the workload was too heavy, and there were many mature flow chart plugins in the open-source community that could be referenced. Therefore, I chose XYFlow as the component for flow visualization and integrated it into the plugin. In August, I further optimized the YAML visual editor and fixed some bugs.

In September, I completed the project submission, passed Google’s review, and successfully concluded the project.

Development Conclusion

Overall, collaborating with Apache Beam’s developers was a very enjoyable process. I learned a lot about Beam, and since I am a student engaged in high-performance geographic computing research, Beam may play a significant role in my future studies and work.

I am excited to remain an active member of the Beam community. I hope to continue contributing to its development, applying what I have learned to both my academic pursuits and future collaborative projects. The experience has strengthened my commitment to open-source innovation and has set a strong foundation for ongoing participation in Apache Beam and related technologies.

Special Thanks

I would like to express my sincere gratitude to my mentor XQ Hu for his guidance and support throughout the project. Without his help, I would not have been able to complete this project successfully. His professionalism, patience, and passion have been truly inspiring. As a Google employee, he consistently dedicated time each week to the open-source community and willingly assisted students like me. His selfless dedication to open source is something I deeply admire and strive to emulate. He is also an exceptionally devoted teacher who not only imparted technical knowledge but also taught me how to communicate more effectively, handle interpersonal relationships, and collaborate better in a team setting. He always patiently addressed my questions and provided invaluable advice. I am immensely grateful to him and hope to have the opportunity to work with him again in the future.

I also want to thank the Apache Beam community for their valuable feedback and suggestions, which have greatly contributed to the improvement of the plugin. I feel incredibly fortunate that we, as a society, have open-source communities where individuals contribute their intellect and time to drive collective technological progress and innovation. These communities provide students like me with invaluable opportunities to grow and develop rapidly.

Finally, I would like to thank the Google Summer of Code program for providing me with this opportunity to contribute to open-source projects and gain valuable experience. Without Google Summer of Code, I might never have had the chance to engage with so many open-source projects, take that first step into the open-source community, or experience such substantial personal and professional growth.

Google Summer of Code 2025 - Beam ML Vector DB/Feature Store integrations

Fri, 26 Sep 2025 00:00:00 -0400

What Will I Cover In This Blog Post?

I have three objectives in mind when writing this blog post:

Documenting the work I’ve been doing during this GSoC period in collaboration with the Apache Beam community
A thoughtful and cumulative thank you to my mentor and the Beam Community
Writing to an older version of myself before making my first ever contribution to Beam. This can be helpful for future contributors

What Was This GSoC Project About?

The goal of this project is to enhance Beam’s Python SDK by developing connectors for vector databases like Milvus and feature stores like Tecton. These integrations will improve support for ML use cases such as Retrieval-Augmented Generation (RAG) and feature engineering. By bridging Beam with these systems, this project will attract more users, particularly in the ML community.

Why Was This Project Important?

While Beam’s Python SDK supports some vector databases, feature stores and embedding generators, the current integrations are limited to a few systems as mentioned in the tables down below. Expanding this ecosystem will provide more flexibility and richness for ML workflows particularly in feature engineering and RAG applications, potentially attracting more users, particularly in the ML community.

Vector Database	Feature Store	Embedding Generator
BigQuery	Vertex AI	Vertex AI
AlloyDB	Feast	Hugging Face

Why Did I Choose Beam As Part of GSoC Among 180+ Orgs?

I chose to apply to Beam from among 180+ GSoC organizations because it aligns well with my passion for data processing systems that serve information retrieval systems and my core career values:

Freedom: Working on Beam supports open-source development, liberating developers from vendor lock-in through its unified programming model while enabling services like Project Shield to protect free speech globally
Innovation: Working on Beam allows engagement with cutting-edge data processing techniques and distributed computing paradigms
Accessibility: Working on Beam helps build open-source technology that makes powerful data processing capabilities available to all organizations regardless of size or resources. This accessibility enables projects like Project Shield to provide free protection to media, elections, and human rights websites worldwide

What Did I Work On During the GSoC Program?

During my GSoC program, I focused on developing connectors for vector databases, feature stores, and embedding generators to enhance Beam’s ML capabilities. Here are the artifacts I worked on and what remains to be done:

Type	System	Artifact
Enrichment Handler	Milvus	PR #35216 PR #35577 PR #35467
Sink I/O	Milvus	PR #35708 PR #35944
Enrichment Handler	Tecton	PR #36062
Sink I/O	Tecton	PR #36078
Embedding Gen	OpenAI	PR #36081
Embedding Gen	Anthropic	To Be Added

Here are side-artifacts that are not directly linked to my project:

Type	System	Artifact
AI Code Review	Gemini Code Assist	PR #35532
Enrichment Handler	CloudSQL	PR #34398 PR #35473
Pytest Markers	GitHub CI	PR #35655 PR #35740 PR #35816

For more granular contributions, checking out my ongoing Beam contributions.

How Did I Approach This Project?

My approach centered on community-driven design and iterative implementation, Originally inspired by my mentor’s work. Here’s how it looked:

Design Document: Created a comprehensive design document outlining the proposed ML connector architecture
Community Feedback: Shared the design with the Beam developer community mailing list for review
Iterative Implementation: Incorporated community feedback and applied learnings in subsequent pull requests
Continuous Improvement: Refined the approach based on real-world usage patterns and maintainer guidance

Here are some samples of those design docs:

Component	Type	Design Document
Milvus	Vector Enrichment Handler	[Proposal][GSoC 2025] Milvus Vector Enrichment Handler for Beam
Milvus	Vector Sink I/O Connector	[Proposal][GSoC 2025] Milvus Vector Sink I/O Connector for Beam
Tecton	Feature Store Enrichment Handler	[Proposal][GSoC 2025] Tecton Feature Store Enrichment Handler for Beam
Tecton	Feature Store Sink I/O Connector	[Proposal][GSoC 2025] Tecton Feature Store Sink I/O Connector for Beam

Where Did Challenges Arise During The Project?

There were 2 places where challenges arose:

Running Docker TestContainers in Beam Self-Hosted CI Environment: The main challenge was that Beam runs in CI on Ubuntu 20.04, which caused compatibility and connectivity issues with Milvus TestContainers due to the Docker-in-Docker environment. After several experiments with trial and error, I eventually tested with Ubuntu latest (which at the time of writing this blog post is Ubuntu 25.04), and no issues arose. This version compatibility problem led to the container startup failures and network connectivity issues
Triggering and Modifying the PostCommit Python Workflows: This challenge magnified the above issue since for every experiment update to the given workflow, I had to do a round trip to my mentor to include those changes in the relevant workflow files and evaluate the results. I also wasn’t aware that someone can trigger post-commit Python workflows by updating the trigger files in .github/trigger_files until near the middle of GSoC. I discovered there is actually a workflows README document in .github/workflows/README.md that was not referenced in the CONTRIBUTING.md file at the time of writing this post

How Did This Project Start To Attract Users in the ML Community?

It is observed that after we had a Milvus Enrichment Handler PR before even merging, we started to see community-driven contributions like this one that adds Qdrant. Qdrant is a competitor to Milvus in the vector space. This demonstrates how the project’s momentum and visibility in the ML community space attracted contributors who wanted to expand the Beam ML ecosystem with additional vector database integrations.

How Did This GSoC Experience Working With Beam Community Shape Me?

If I have to boil it down across three dimensions, they would be:

Mindset: Before I was probably working in solitude making PRs about new integrations with mental chatter in the form of fingers crossed, hoping that there will be no divergence on the design. Now I can engage people I am working with through design docs, making sure my work aligns with their vision, which potentially leads to faster PR merges
Skillset: It was one year before contributing to Beam where I wrote professionally in Python, so it was a great opprtunity to brush up on my Python skills and seeing how some design patterns are used in practice, like the query builder pattern seen in CloudSQL Vector Ingestion in the RAG package. I also learned about vector databases and feature stores, and also some AI integrations. I also think I got a bit better than before in root cause analysis and filtering signals from noise in long log files like PostCommit Python workflows
Toolset: Learning about Beam Python SDK, Milvus, Tecton, Google CloudSQL, OpenAI and Anthropic text embedding generators, and lnav for effective log file navigation, including their capabilities and limitations

Tips for Future Contributors

If I have to boil them down to three, they would be:

Observing: Observing how experienced developers in the Beam dev team workâ€”how their PRs look, how they write design docs, what kind of feedback they get on their design docs and PRs, and how you can apply it (if feasible) to avoid getting the same feedback again. What kind of follow-up PRs do they create after their initial ones? How do they document and illustrate their work? What kind of comments do they post when reviewing other people’s related work? Over time, you build your own mental model and knowledge base on how the ideal contribution looks in this area. There is a lot to learn and explore in an exciting, not intimidating way
Orienting: Understanding your place in the ecosystem and aligning your work with the project’s context. This means grasping how your contribution fits into Beam’s architecture and roadmap, identifying your role in addressing current gaps, and mapping stakeholders who will review, use, and maintain your work. Most importantly, align with both your mentor’s vision and the community’s vision to ensure your work serves the broader goals
Acting: Acting on feedback from code reviews, design document discussions, and community input. This means thoughtfully addressing suggested changes in a way that moves the discussion forward, addressing concerns raised by maintainers, and iterating on your work based on community guidance. Being responsive to feedback, asking clarifying questions when needed, and demonstrating that you’re incorporating the community’s input into your contributions given that it is aligned with the project direction

Who Do I Want To Thank for Making This Journey Possible?

If I have to boil them down to three, they would be:

My Mentor, Danny McCormick: I wouldn’t hesitate to say that Danny is the best mentor I have worked with so far, given that I have worked with several mentors. What makes me say that:
- Generosity: Danny is very generous with his time, feedback, and genuinely committed to reviewing my work on a regular basis. We have weekly 30-minute sync calls over almost 21 weeks (5 months) since the official community bonding period, where he shares with me his contextual expertise and addresses any questions I may have with openness to extend time if needed and flexible about skipping calls when there was no agenda
- Flexibility: When I got accepted to GSoC, after a few days I also got accepted to a part-time internship that I had applied to before GSoC, while also managing my last semester in my Bachelor of Computer Science, which was probably the hardest semester. During our discussion about working capacity, Danny was very flexible regarding that, with more emphasis on making progress, which encouraged me to make even more progress. I have also never felt there are very hard boundaries around my project scopeâ€”I felt there was an area to explore that motivated me to think of and add some side-artifacts to Beam, e.g., adding Gemini Code Assist for AI code review
- Proactivity: Danny was very proactive in offering support and help without originally asking, e.g., making Beam Infra tickets that add API keys to unblock my work
Beam Community: From my first ever contribution to Beam adding FlattenWith and Tee examples to the playground, I was welcomed with open arms and felt encouraged to make more contributions. Also, for their valuable comments on my design documents on the dev mailing list as well as the PRs
Google: I would like to genuinely thank Google for introducing me to open source in GSoC 2023 and giving me a second chance to interact with Apache Beam through GSoC 2025. Without it, I probably wouldn’t be here writing this blog post, nor would I have this fruitful experience

What’s Next?

I am now focusing on helping move the remaining artifacts in this project scope from the in-progress state to the merging state. After this, I would love to keep my contributions alive in Beam Python and Go SDK, to name a few. I would also love to connect with you all on my LinkedIn and GitHub.

References

Google Summer of Code 2025 - Beam YAML, Kafka and Iceberg User Accessibility

Tue, 23 Sep 2025 00:00:00 -0400

The relatively new Beam YAML SDK was introduced in the spirit of making data processing easy, but it has gained little adoption for complex ML tasks and hasnâ€™t been widely used with Managed I/O such as Kafka and Iceberg. As part of Google Summer of Code 2025, new illustrative, production-ready pipeline examples of ML use cases with Kafka and Iceberg data sources using the YAML SDK have been developed to address this adoption gap.

Context

The YAML SDK was introduced in Spring 2024 as Beamâ€™s first no-code SDK. It follows a declarative approach of defining a data processing pipeline using a YAML DSL, as opposed to other programming language specific SDKs. At the time, it had few meaningful examples and documentation to go along with it. Key missing examples were ML workflows and integration with the Kafka and Iceberg Managed I/O. Foundational work had already been done to add support for ML capabilities as well as Kafka and Iceberg IO connectors in the YAML SDK, but there were no end-to-end examples demonstrating their usage.

Beam, as well as Kafka and Iceberg, are mainstream big data technologies but they also have a learning curve. The overall theme of the project is to help democratize data processing for scientists and analysts who traditionally donâ€™t have a strong background in software engineering. They can now refer to these meaningful examples as the starting point, helping them onboard faster and be more productive when authoring ML/data pipelines to their use cases with Beam and its YAML DSL.

Contributions

The data pipelines/workflows developed are production-ready: Kafka and Iceberg data sources are set up on GCP, and the data used are raw public datasets. The pipelines are tested end-to-end on Google Cloud Dataflow and are also unit tested to ensure correct transformation logic.

Delivered pipelines/workflows, each with documentation as README.md, address 4 main ML use cases below:

Streaming Classification Inference: A streaming ML pipeline that demonstrates Beam YAML capability to perform classification inference on a stream of incoming data from Kafka. The overall workflow also includes DistilBERT model deployment and serving on Google Cloud Vertex AI where the pipeline can access for remote inferences. The pipeline is applied to a sentiment analysis task on a stream of YouTube comments, preprocessing data and classifying whether a comment is positive or negative. See pipeline and documentation.
Streaming Regression Inference: A streaming ML pipeline that demonstrates Beam YAML capability to perform regression inference on a stream of incoming data from Kafka. The overall workflow also includes custom model training, deployment and serving on Google Cloud Vertex AI where the pipeline can access for remote inferences. The pipeline is applied to a regression task on a stream of taxi rides, preprocessing data and predicting the fare amount for every ride. See pipeline and documentation.
Batch Anomaly Detection: A ML workflow that demonstrates ML-specific transformations and reading from/writing to Iceberg IO. The workflow contains unsupervised model training and several pipelines that leverage Iceberg for storing results, BigQuery for storing vector embeddings and MLTransform for computing embeddings to demonstrate an end-to-end anomaly detection workflow on a dataset of system logs. See workflow and documentation.
Feature Engineering & Model Evaluation: A ML workflow that demonstrates Beam YAML capability to do feature engineering which is subsequently used for model evaluation, and its integration with Iceberg IO. The workflow contains model training and several pipelines, showcasing an end-to-end Fraud Detection MLOps solution that generates features and evaluates models to detect credit card transaction frauds. See workflow and documentation.

Challenges

The main challenge of the project was a lack of previous YAML pipeline examples and good documentation to rely on. Unlike the Python or Java SDKs where there are already many notebooks and end-to-end examples demonstrating various use cases, the examples for YAML SDK only involved simple transformations such as filter, group by, etc. More complex transforms like MLTransform and ReadFromIceberg had no examples and requires configurations that didn’t have clear API reference at the time. As a result, there were a lot of deep dives into the actual implementation of the PTransforms across YAML, Python and Java SDKs to understand the error messages and how to correctly use the transforms.

Another challenge was writing unit tests for the pipeline to ensure that the pipelineâ€™s logic is correct. It was a learning curve to understand how the existing test suite is set up and how it can be used to write unit tests for the data pipelines. A lot of time was spent on properly writing mocks for the pipeline’s sources and sinks, as well as for the transforms that require external services such as Vertex AI.

Conclusion & Personal Thoughts

These production-ready pipelines demonstrate the potential of Beam YAML SDK to author complex ML workflows that interact with Iceberg and Kafka. The examples are a nice addition to Beam, especially with Beam 3.0.0 milestones coming up where low-code/no-code, ML capabilities and Managed I/O are focused on.

I had an amazing time working with the big data technologies Beam, Iceberg, and Kafka as well as many Google Cloud services (Dataflow, Vertex AI and Google Kubernetes Engine, to name a few). Iâ€™ve always wanted to work more in the ML space, and this experience has been a great growth opportunity for me. Google Summer of Code this year has been selective, and the project’s success would not have been possible without the support of my mentor, Chamikara Jayalath. It’s been a pleasure working closely with him and the broader Beam community to contribute to this open-source project that has a meaningful impact on the data engineering community.

My advice for future Google Summer of Code participants is to first and foremost research and choose a project that aligns closely with your interest. Most importantly, spend a lot of time making yourself visible and writing a good proposal when the program is opened for applications. Being visible (e.g. by sharing your proposal, or generally any ideas and questions on the project’s communication channel early on) makes it more likely for you to be selected; and a good proposal not only will make you even more likely to be in the program, but also give you a lot of confidence when contributing to and completing the project.

References

Apache Beam 2.68.0

Mon, 22 Sep 2025 15:00:00 -0500

We are happy to present the new 2.68.0 release of Beam. This release includes both improvements and new functionality. See the download page for this release.

For more information on changes in 2.68.0, check out the detailed release notes.

Highlights

[Python] Prism runner now enabled by default for most Python pipelines using the direct runner (#34612). This may break some tests, see https://github.com/apache/beam/pull/34612 for details on how to handle issues.

I/Os

Upgraded Iceberg dependency to 1.9.2 (#35981)

New Features / Improvements

BigtableRead Connector for BeamYaml added with new Config Param (#35696)
MongoDB Java driver upgraded from 3.12.11 to 5.5.0 with API refactoring and GridFS implementation updates (Java) (#35946).
Introduced a dedicated module for JUnit-based testing support: sdks/java/testing/junit, which provides TestPipelineExtension for JUnit 5 while maintaining backward compatibility with existing JUnit 4 TestRule-based tests (Java) (#18733, #35688).
- To use JUnit 5 with Beam tests, add a test-scoped dependency on org.apache.beam:beam-sdks-java-testing-junit.
Google CloudSQL enrichment handler added (Python) (#34398). Beam now supports data enrichment capabilities using SQL databases, with built-in support for:
- Managed PostgreSQL, MySQL, and Microsoft SQL Server instances on CloudSQL
- Unmanaged SQL database instances not hosted on CloudSQL (e.g., self-hosted or on-premises databases)
[Python] Added the ReactiveThrottler and ThrottlingSignaler classes to streamline throttling behavior in DoFns, expose throttling mechanisms for users (#35984)
Added a pipeline option to specify the processing timeout for a single element by any PTransform (Java/Python/Go) (#35174).
- When specified, the SDK harness automatically restarts if an element takes too long to process. Beam runner may then retry processing of the same work item.
- Use the --element_processing_timeout_minutes option to reduce the chance of having stalled pipelines due to unexpected cases of slow processing, where slowness might not happen again if processing of the same element is retried.
(Python) Adding GCP Spanner Change Stream support for Python (apache_beam.io.gcp.spanner) (#24103).

Breaking Changes

Previously deprecated Beam ZetaSQL component has been removed (#34423). ZetaSQL users could migrate to Calcite SQL with BigQuery dialect enabled.
Upgraded Beam vendored Calcite to 1.40.0 for Beam SQL (#35483), which improves support for BigQuery and other SQL dialects. Note: Minor behavior changes are observed such as output significant digits related to casting.
(Python) The deterministic fallback coder for complex types like NamedTuple, Enum, and dataclasses now uses cloudpickle instead of dill. If your pipeline is affected, you may see a warning like: “Using fallback deterministic coder for type X…”. You can revert to the previous behavior by using the pipeline option --update_compatibility_version=2.67.0 (35725). Report any pickling related issues to #34903
(Python) Prism runner now enabled by default for most Python pipelines using the direct runner (#34612). This may break some tests, see https://github.com/apache/beam/pull/34612 for details on how to handle issues.
Dropped Java 8 support for IO expansion-service. Cross-language pipelines using this expansion service will need a Java11+ runtime (#35981.

Deprecations

Python SDK native SpannerIO (apache_beam/io/gcp/experimental/spannerio) is deprecated. Use cross-language wrapper (apache_beam/io/gcp/spanner) instead (Python) (#35860).
Samza runner is deprecated and scheduled for removal in Beam 3.0 (#35448).
Twister2 runner is deprecated and scheduled for removal in Beam 3.0 (#35905)).

Bugfixes

(Python) Fixed Java YAML provider fails on Windows (#35617).
Fixed BigQueryIO creating temporary datasets in wrong project when temp_dataset is specified with a different project than the pipeline project. For some jobs, temporary datasets will now be created in the correct project (Python) (#35813).
(Go) Fix duplicates due to reads after blind writes to Bag State (#35869).
- Earlier Go SDK versions can avoid the issue by not reading in the same call after a blind write.

List of Contributors

According to git shortlog, the following people contributed to the 2.68.0 release. Thank you to all contributors!

Ahmed Abualsaud, Andrew Crites, Ashok Devireddy, Chamikara Jayalath, Charles Nguyen, Danny McCormick, Davda James, Derrick Williams, Diego Hernandez, Dip Patel, Dustin Rhodes, Enrique Calderon, Hai Joey Tran, Jack McCluskey, Kenneth Knowles, Keshav, Khorbaladze A., LEEKYE, Lanny Boarts, Mattie Fu, Minbo Bae, Mohamed Awnallah, Naireen Hussain, Nathaniel Young, RadosÅ‚aw Stankiewicz, Razvan Culea, Robert Bradshaw, Robert Burke, Sam Whittle, Shehab, Shingo Furuyama, Shunping Huang, Steven van Rossum, Suvrat Acharya, Svetak Sundhar, Tarun Annapareddy, Tom Stepp, Valentyn Tymofieiev, Vitaly Terentyev, XQ Hu, Yi Hu, apanich, arnavarora2004, claudevdm, flpablo, kristynsmith, shreyakhajanchi

Google Summer of Code 25 - Improving Apache Beam's Infrastructure

Mon, 15 Sep 2025 00:00:00 -0600

I loved contributing to Apache Beam during Google Summer of Code 2025. I worked on improving the infrastructure of Apache Beam, which included enhancing the CI/CD pipelines, automating various tasks, and improving the overall developer experience.

Motivation

Since I was in high school, I have been fascinated by computers, but when I discovered Open Source, I was amazed by the idea of people from all around the world collaborating to build software that anyone can use, just for the love of it. I started participating in open source communities, and I found it to be a great way to learn and grow as a developer.

When I heard about Google Summer of Code, I saw it as an opportunity to take my open source contributions to the next level. The idea of working on a real-world project while being mentored by experienced developers sounded like an amazing opportunity. I heard about Apache Beam from another contributor and ex-GSoC participant, and I was immediately drawn to the project, specifically on the infrastructure side of things, as I have a strong interest in DevOps and automation.

The Challenge

When searching for a project, I was told that Apache Beam’s infrastructure had several areas that could be improved. I was excited because the ideas were focused on improving the developer experience, and creating tools that could benefit not only Beam’s developers but also the wider open source community.

There were four main challenges:

Automating the cleanup of unused cloud resources to reduce costs and improve resource management.
Implementing a system for managing permissions through Git, allowing for better tracking and auditing of changes.
Creating a tool for rotating service account keys to enhance security.
Developing a security monitoring system to detect and respond to potential threats.

The Solution

I worked closely with my mentor to break down and define each challenge into manageable tasks, creating a plan for the summer. I started by taking a look at the current state of the infrastructure, after which I began working on each challenge one by one.

Automating the cleanup of unused cloud resources: We noticed that some resources in the GCP project, especially Pub/Sub topics created for testing, were often forgotten, leading to unnecessary costs. Since the infrastructure is primarily for testing and development, there’s no need to keep unused resources. I developed a Python script that identifies and removes stale Pub/Sub topics that have existed for too long. This tool is now scheduled to run periodically via a GitHub Actions workflow to keep the project tidy and cost-effective.
Implementing a system for managing permissions through Git: This was more challenging, as it required a good understanding of both GCP IAM and the existing workflow. After some investigation, I learned that the current process was mostly manual and error-prone. The task involved creating a more automated and reliable system. This was achieved by using Terraform to define the desired state of IAM roles and permissions in code, which allows for better tracking and auditing of changes. This also included some custom roles, but that is still a work in progress.
Creating a tool for rotating service account keys: Key rotation is a security practice that we don’t always follow, but it is essential to ensure that service account keys are not compromised. I noticed that GCP had some APIs that could help with this, but the rotation process itself was not automated. So I wrote a Python script that automates the rotation of GCP service account keys, enhancing the security of service account credentials.
Developing a security monitoring system: To keep track of incorrect usage and potential threats, I built a log analysis tool that monitors GCP audit logs for suspicious activity, collecting and parsing logs to identify potential security threats, delivering email alerts when something unusual is detected.

As an extra, and after noticing that some of these tools and policies could be ignored by developers, we also came up with the idea of an enforcement module to ensure the usage of these new tools and policies. This module would be integrated into the CI/CD pipeline, checking for compliance with the new infrastructure policies and notifying developers of any violations.

The Impact

The tools developed during this project will have an impact on the Apache Beam community and the wider open source community. The automation of resource cleanup will help reduce costs and improve resource management, while the permission management system will provide better tracking and auditing of changes. The service account key rotation tool will enhance security, and the security monitoring system will help detect and respond to potential threats.

Wrap Up

This project has been an incredible learning experience for me. I have gained a better understanding of how GCP works, as well as how to use Terraform and GitHub Actions. I have also learned a lot about security best practices and how to implement them in a real-world project.

I also learned a lot about working in an open source community, having direct communication with such experienced developers, and the importance of collaboration and communication in a distributed team. I am grateful for the opportunity to work on such an important project and to contribute to the Apache Beam community.

Finally, a special thanks to my mentor, Pablo Estrada, for his guidance and support throughout the summer. I am grateful not only for his amazing technical skills but especially for his patience and encouragement on my journey contributing to open source.

You can find my final report here if you want to take a look at the details of my work.

Advice for Future Participants

If you are considering participating in Google Summer of Code, my advice would be to choose an area you are passionate about; this will make any coding challenge easier to overcome. Also, don’t be afraid to ask questions and seek help from your mentors and the community. At the start, I made that mistake, and I learned that asking for help is a sign of strength, not weakness.

Finally, make sure to manage your time effectively and stay organized (keeping a progress journal is a great idea). GSoC is a great opportunity to learn and grow as a developer, but it can also be time-consuming, so it’s important to stay focused and on track.

Apache Beam 2.67.0

Tue, 12 Aug 2025 15:00:00 -0500

We are happy to present the new 2.67.0 release of Beam. This release includes both improvements and new functionality. See the download page for this release.

For more information on changes in 2.67.0, check out the detailed release notes.

Highlights

I/Os

Debezium IO upgraded to 3.1.1 requires Java 17 (Java) (#34747).
Add support for streaming writes in IOBase (Python)
Implement support for streaming writes in FileBasedSink (Python)
Expose support for streaming writes in TextIO (Python)

New Features / Improvements

Added support for Processing time Timer in the Spark Classic runner (#33633).
Add pip-based install support for JupyterLab Sidepanel extension (#35397).
[IcebergIO] Create tables with a specified table properties (#35496)
Add support for comma-separated options in Python SDK (Python) (#35580). Python SDK now supports comma-separated values for experiments and dataflow_service_options, matching Java SDK behavior while maintaining backward compatibility.
Milvus enrichment handler added (Python) (#35216). Beam now supports Milvus enrichment handler capabilities for vector, keyword, and hybrid search operations.
[Beam SQL] Add support for DATABASEs, with an implementation for Iceberg (#35637)
Respect BatchSize and MaxBufferingDuration when using JdbcIO.WriteWithResults. Previously, these settings were ignored (#35669).

Breaking Changes

Go: The pubsubio.Read transform now accepts ReadOptions as a value type instead of a pointer, and requires exactly one of Topic or Subscription to be set (they are mutually exclusive). Additionally, the ReadOptions struct now includes a Topic field for specifying the topic directly, replacing the previous topic parameter in the Read function signature (#35369).
SQL: The ParquetTable external table provider has changed its handling of the LOCATION property. To read from a directory, the path must now end with a trailing slash (e.g., LOCATION '/path/to/data/'). Previously, a trailing slash was not required. This change was made to enable support for glob patterns and single-file paths (#35582).

Bugfixes

[YAML] Fixed handling of missing optional fields in JSON parsing (#35179).
[Python] Fix WriteToBigQuery transform using CopyJob does not work with WRITE_TRUNCATE write disposition (#34247)
[Python] Fixed dicomio tags mismatch in integration tests (#30760).
[Java] Fixed spammy logging issues that affected versions 2.64.0 to 2.66.0.

Known Issues

(#35666). YAML Flatten incorrectly drops fields when input PCollections’ schema are different. This issue exists for all versions since 2.52.0.

List of Contributors

According to git shortlog, the following people contributed to the 2.67.0 release. Thank you to all contributors!

Aditya Shukla, Ahmed Abualsaud, Arun Pandian, Boris Li, Chamikara Jayalath, Charles Nguyen, Chenzo, Danny McCormick, David Adeniji, Derrick Williams, Dmytro Tsyliuryk, Dustin Rhodes, Enrique Calderon, Gottipati Gautam, Hai Joey Tran, Hunor Portik, Jack McCluskey, Kenneth Knowles, Khorbaladze A., Marcio Sugar, Minh Son Nguyen, Mohamed Awnallah, Nathaniel Young, Nhon Dinh, Quentin Sommer, Rafael Raposo, Rakesh Kumar, Razvan Culea, Reuven Lax, Robert Bradshaw, Sam Whittle, Shunping Huang, Steven van Rossum, Talat UYARER, Tanu Sharma, Tarun Annapareddy, Tobi Kaymak, Tobias Kaymak, Valentyn Tymofieiev, Veronica Wasson, Vitaly Terentyev, XQ Hu, Yi Hu, akashorabek, arnavarora2004, changliiu, claudevdm, fozzie15, mvhensbergen, twosom

Our Experience at Beam College 2025: 1st Place Hackathon Winners

Tue, 08 Jul 2025 00:00:00 +0000

Introduction: The Beam of an Idea

In the world of machine learning for healthcare, preprocessing large pathology image datasets at scale remains a bottleneck. Whole Slide Images (WSIs) in medical imaging can reach massive sizes. Traditional Python tools (PIL, etc.) fail under memory pressure, especially when handling thousands of such high-resolution images. This becomes a bottleneck for ML modeling tasks using standard tools.

Having previously worked on image processing for object detection in machine learning, we also understood how crucial it is to preprocess and structure image data correctly for downstream tasks. These challenges are non-trivial and even more critical in healthcare, making it a natural and high-impact use case for scalable data processing frameworks like Apache Beam.

So, in the Beam Summit 2025 Hackathon, we joined as team “PCollectors” with the goal to leverage Beam to process large image data and convert it to a format suitable for downstream ML tasks. We were amazed to know that we secured 1st place with the implemented solution!

The Project: Scalable WSI Preprocessing Beam Pipeline

GitHub Repo

The Goal

The primary objective of the pipeline was to process patient data (CSV) & WSIs, extract embeddings, combine the metadata, and output the final dataset in TFRecord format, ready for large-scale ML training.

Solution Overview

Our pipeline processes:

Patient metadata (CSV)
WSI files (.tif)
Split the images into â€œtilesâ€
Extract filtered image tiles based on the background threshold
Generate max & avg embeddings per patient using EfficientNet
Merge metadata + embeddings into TFRecords

All in a scalable, memory-efficient, cloud-native pipeline using Apache Beam and Dataflow.

Dataset

Source: Mayo Clinic STRIP AI Dataset (Kaggle) Metadata: Each row = { image_id, center_id, patient_id, image_num, label } Multiple images per patient Labels exist only at the patient level Images: High-res .tif pathology slides

Tech Stack

Apache Beam: Orchestration engine
Google Cloud Dataflow: Scalable runner
Google Cloud Storage: Input TIFFs + output TFRecords
TensorFlow: For embedding generation (EfficientNet) and TFRecord serialization

The Hackathon Journey

Participating in the hackathon introduced us to multiple new things and allowed us to learn and implement simultaneously. Through the hackathon weekend, we:

Designed the end-to-end pipeline
Integrated pyvips + openslide for efficient image loading
Used Beam’s RunInference API with TensorFlow
Tiled and filtered images
Wrote patient-level embeddings to TFRecords

What we Learnt

Apache Beam is really powerful for parallel and cloud-native ML preprocessing. Dataflow is the go-to tool when processing large data, like medical images

Whatâ€™s Next for The Project

Looking ahead, the pipeline can be extended beyond fixed-size tiling by incorporating image segmentation techniques to generate more meaningful patches based on tissue regions. This approach can improve ML model performance by focusing only on relevant areas. Moreover, the same preprocessing framework can be adapted for video data, where frames can be treated as time-indexed image slices, effectively enabling temporal modeling for time-series tasks such as motion analysis or progression tracking. Finally, we plan to adapt this pipeline to multiple downstream use cases for AI in healthcare by combining histology images with genomic data, clinical notes, or radiology scans, paving the way for more comprehensive and context-aware models in biomedical machine learning.

Project Submission Demo: Beam Demo - PCollectors.mp4

Conclusion

We are ML Engineers, working at Intuitive.Cloud, where we play around with large-scale data to build scalable, efficient, dynamic data processing pipelines that prepare it for downstream ML tasks, with Apache Beam and Google Cloud DataFlow being the central pieces.

Participating in the hackathon was a great learning opportunity, huge thanks to the organizers, mentors, and the Apache Beam community!

- Aditya Shukla & Darshan Kanade

Apache Beam 2.66.0

Tue, 01 Jul 2025 15:00:00 -0500

We are happy to present the new 2.66.0 release of Beam. This release includes both improvements and new functionality. See the download page for this release.

For more information on changes in 2.66.0, check out the detailed release notes.

Beam 3.0.0 Development Highlights

[Java] Java 8 support is now deprecated. It is still supported until Beam 3. From now, pipeline submitted by Java 8 client uses Java 11 SDK container for remote pipeline execution (35064).

Highlights

[Python] Several quality-of-life improvements to the vLLM model handler. If you use Beam RunInference with vLLM model handlers, we strongly recommend updating past this release.

I/Os

[IcebergIO] Now available with Beam SQL! (#34799)
[IcebergIO] Support reading with column pruning (#34856)
[IcebergIO] Support reading with pushdown filtering (#34827)
[IcebergIO] Create tables with a specified partition spec (#34966, #35268)
[IcebergIO] Dynamically create namespaces if needed (#35228)

New Features / Improvements

[Beam SQL] Introducing Beam Catalogs (#35223)
Adding Google Storage Requests Pays feature (Golang)(#30747).
[Python] Prism runner now auto-enabled for some Python pipelines using the direct runner (#34921).
[YAML] WriteToTFRecord and ReadFromTFRecord Beam YAML support
Python: Added JupyterLab 4.x extension compatibility for enhanced notebook integration (#34495).

Breaking Changes

Yapf version upgraded to 0.43.0 for formatting (Python) (#34801).
Python: Added JupyterLab 4.x extension compatibility for enhanced notebook integration (#34495).
Python: Argument abbreviation is no longer enabled within Beam. If you previously abbreviated arguments (e.g. --r for --runner), you will now need to specify the whole argument (#34934).
Java: Users of ReadFromKafkaViaSDF transform might encounter pipeline graph compatibility issues when updating the pipeline. To mitigate, set the updateCompatibilityVersion option to the SDK version used for the original pipeline, example --updateCompatabilityVersion=2.64.0
Python: Updated AlloyDBVectorWriterConfig API to align with new PostgresVectorWriter transform. Heres a quick guide to update your code: (#35225)

Bugfixes

(Java) Fixed CassandraIO ReadAll does not let a pipeline handle or retry exceptions (#34191).
[Python] Fixed vLLM model handlers breaking Beam logging. (#35053).
[Python] Fixed vLLM connection leaks that caused a throughput bottleneck and underutilization of GPU (#35053).
[Python] Fixed vLLM server recovery mechanism in the event of a process termination (#35234).
(Python) Fixed cloudpickle overwriting class states every time loading a same object of dynamic class (#35062).
[Python] Fixed pip install apache-beam[interactive] causes crash on google colab (#35148).
[IcebergIO] Fixed Beam <-> Iceberg conversion logic for arrays of structs and maps of structs (#35230).

Known Issues

N/A

List of Contributors

According to git shortlog, the following people contributed to the 2.66.0 release. Thank you to all contributors!

Aditya Yadav, Adrian Stoll, Ahmed Abualsaud, Bhargavkonidena, Chamikara Jayalath, Charles Nguyen, Chenzo, Damon, Danny McCormick, Derrick Williams, Enrique Calderon, Hai Joey Tran, Jack McCluskey, Kenneth Knowles, Leonardo Cesar Borges, Michael Gruschke, Minbo Bae, Minh Son Nguyen, Niel Markwick, RadosÅ‚aw Stankiewicz, Rakesh Kumar, Robert Bradshaw, S. VeyriÃ©, Sam Whittle, Shubham Jaiswal, Shunping Huang, Steven van Rossum, Tanu Sharma, Vardhan Thigle, Vitaly Terentyev, XQ Hu, Yi Hu, akashorabek, atask-g, atognolag, bullet03, changliiu, claudevdm, fozzie15, ikarapanca, kristynsmith, Pablo Rodriguez Defino, tvalentyn, twosom, wollowizard