WordCount quickstart for Java

This quickstart shows you how to set up a Java development environment and run an example pipeline written with the Apache Beam Java SDK, using a runner of your choice.

If you’re interested in contributing to the Apache Beam Java codebase, see the Contribution Guide.

On this page:

Set up your development environment

Download and install the Java Development Kit (JDK) version 8, 11, or 17. Verify that the JAVA_HOME environment variable is set and points to your JDK installation.
Download and install Apache Maven by following the installation guide for your operating system.
Optional: If you want to convert your Maven project to Gradle, install Gradle.

Get the example code

Generate a Maven example project that builds against the latest Beam release:

mvn archetype:generate \
    -DarchetypeGroupId=org.apache.beam \
    -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
    -DarchetypeVersion=2.60.0 \
    -DgroupId=org.example \
    -DartifactId=word-count-beam \
    -Dversion="0.1" \
    -Dpackage=org.apache.beam.examples \
    -DinteractiveMode=false

mvn archetype:generate `
  -D archetypeGroupId=org.apache.beam `
  -D archetypeArtifactId=beam-sdks-java-maven-archetypes-examples `
  -D archetypeVersion=2.60.0 `
  -D groupId=org.example `
  -D artifactId=word-count-beam `
  -D version="0.1" `
  -D package=org.apache.beam.examples `
  -D interactiveMode=false
   

Maven creates a new project in the word-count-beam directory.

Change into word-count-beam:
```
cd word-count-beam/
   
```
```
cd .\word-count-beam
   
```
The directory contains a pom.xml and a src directory with example pipelines.
List the example pipelines:
```
ls src/main/java/org/apache/beam/examples/
   
```
```
dir .\src\main\java\org\apache\beam\examples
   
```
You should see the following examples:
- DebuggingWordCount.java (GitHub)
- MinimalWordCount.java (GitHub)
- WindowedWordCount.java (GitHub)
- WordCount.java (GitHub)
The example used in this tutorial, WordCount.java, defines a Beam pipeline that counts words from an input file (by default, a .txt file containing Shakespeare’s “King Lear”). To learn more about the examples, see the WordCount Example Walkthrough.

Optional: Convert from Maven to Gradle

The steps below explain how to convert the build from Maven to Gradle for the following runners:

Direct runner
Dataflow runner

The conversion process for other runners is similar. For additional guidance, see Migrating Builds From Apache Maven.

In the directory with the pom.xml file, run the automated Maven-to-Gradle conversion:
```
gradle init
   
```
You’ll be asked if you want to generate a Gradle build. Enter yes. You’ll also be prompted to choose a DSL (Groovy or Kotlin). For this tutorial, enter 2 for Kotlin.

Open the generated build.gradle.kts file and make the following changes:

In repositories, replace mavenLocal() with mavenCentral().

In repositories, declare a repository for Confluent Kafka dependencies:

maven {
    url = uri("https://packages.confluent.io/maven/")
}

At the end of the build script, add the following conditional dependency:

if (project.hasProperty("dataflow-runner")) {
    dependencies {
        runtimeOnly("org.apache.beam:beam-runners-google-cloud-dataflow-java:2.60.0")
    }
}

At the end of the build script, add the following task:

tasks.register<JavaExec>("execute") {
  mainClass.set(System.getProperty("mainClass"))
  classpath = sourceSets.main.get().runtimeClasspath
}

Build your project:
```
gradle build
   
```

Get sample text

If you’re planning to use the DataflowRunner, you can skip this step. The runner will pull text directly from Google Cloud Storage.

In the word-count-beam directory, create a file called sample.txt.
Add some text to the file. For this example, use the text of Shakespeare’s King Lear.

Run a pipeline

A single Beam pipeline can run on multiple Beam runners. The DirectRunner is useful for getting started, because it runs on your machine and requires no specific setup. If you’re just trying out Beam and you’re not sure what to use, use the DirectRunner.

The general process for running a pipeline goes like this:

Complete any runner-specific setup.
Build your command line:
1. Specify a runner with --runner=<runner> (defaults to the DirectRunner).
2. Add any runner-specific required options.
3. Choose input files and an output location that are accessible to the runner. (For example, you can’t access a local file if you are running the pipeline on an external cluster.)
Run the command.

To run the WordCount pipeline:

Follow the setup steps for your runner:
The DirectRunner will work without additional setup.
Run the corresponding Maven or Gradle command below.

Run WordCount using Maven

For Unix shells:

mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
    -Dexec.args="--inputFile=sample.txt --output=counts" -Pdirect-runner

mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
    -Dexec.args="--runner=FlinkRunner --inputFile=sample.txt --output=counts" -Pflink-runner

mvn package exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
    -Dexec.args="--runner=FlinkRunner --flinkMaster=<flink master> --filesToStage=target/word-count-beam-bundled-0.1.jar \
                 --inputFile=sample.txt --output=/tmp/counts" -Pflink-runner

mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
    -Dexec.args="--runner=SparkRunner --inputFile=sample.txt --output=counts" -Pspark-runner

mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
    -Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \
                 --region=<your-gcp-region> \
                 --gcpTempLocation=gs://<your-gcs-bucket>/tmp \
                 --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \
    -Pdataflow-runner

mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
    -Dexec.args="--inputFile=sample.txt --output=/tmp/counts --runner=SamzaRunner" -Psamza-runner

mvn package -Pnemo-runner && java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount \
    --runner=NemoRunner --inputFile=`pwd`/sample.txt --output=counts

mvn package -Pjet-runner
java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount \
    --runner=JetRunner --jetLocalMode=3 --inputFile=`pwd`/sample.txt --output=counts

For Windows PowerShell:

mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount `
 -D exec.args="--inputFile=sample.txt --output=counts" -P direct-runner

mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount `
 -D exec.args="--runner=FlinkRunner --inputFile=sample.txt --output=counts" -P flink-runner

mvn package exec:java -D exec.mainClass=org.apache.beam.examples.WordCount `
 -D exec.args="--runner=FlinkRunner --flinkMaster=<flink master> --filesToStage=.\target\word-count-beam-bundled-0.1.jar `
               --inputFile=C:\path\to\quickstart\sample.txt --output=C:\tmp\counts" -P flink-runner

mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount `
 -D exec.args="--runner=SparkRunner --inputFile=sample.txt --output=counts" -P spark-runner

mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount `
 -D exec.args="--runner=DataflowRunner --project=<your-gcp-project> `
               --region=<your-gcp-region> \
               --gcpTempLocation=gs://<your-gcs-bucket>/tmp `
               --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" `
 -P dataflow-runner

mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount `
    -D exec.args="--inputFile=sample.txt --output=/tmp/counts --runner=SamzaRunner" -P samza-runner

mvn package -P nemo-runner -DskipTests
java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount `
    --runner=NemoRunner --inputFile=`pwd`/sample.txt --output=counts

mvn package -P jet-runner
java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount `
    --runner=JetRunner --jetLocalMode=3 --inputFile=$pwd/sample.txt --output=counts

Run WordCount using Gradle

For Unix shells:

gradle clean execute -DmainClass=org.apache.beam.examples.WordCount \
    --args="--inputFile=sample.txt --output=counts"

TODO: document Flink on Gradle: https://github.com/apache/beam/issues/21498

TODO: document FlinkCluster on Gradle: https://github.com/apache/beam/issues/21499

TODO: document Spark on Gradle: https://github.com/apache/beam/issues/21502

gradle clean execute -DmainClass=org.apache.beam.examples.WordCount \
    --args="--project=<your-gcp-project> --inputFile=gs://apache-beam-samples/shakespeare/* \
    --output=gs://<your-gcs-bucket>/counts --runner=DataflowRunner" -Pdataflow-runner

TODO: document Samza on Gradle: https://github.com/apache/beam/issues/21500

TODO: document Nemo on Gradle: https://github.com/apache/beam/issues/21503

TODO: document Jet on Gradle: https://github.com/apache/beam/issues/21501

Inspect the results

After the pipeline has completed, you can view the output. There might be multiple output files prefixed by count. The number of output files is decided by the runner, giving it the flexibility to do efficient, distributed execution.

View the output files in a Unix shell:

ls counts*

ls counts*

ls /tmp/counts*

ls counts*

gsutil ls gs://<your-gcs-bucket>/counts*

ls /tmp/counts*

ls counts*

ls counts*

The output files contain unique words and the number of occurrences of each word.

View the output content in a Unix shell:

more counts*

more counts*

more /tmp/counts*

more counts*

gsutil cat gs://<your-gcs-bucket>/counts*

more /tmp/counts*

more counts*

more counts*

The order of elements is not guaranteed, to allow runners to optimize for efficiency. But the output should look something like this:

...
Think: 3
slower: 1
Having: 1
revives: 1
these: 33
wipe: 1
arrives: 1
concluded: 1
begins: 3
...

Next Steps

Learn more about the Beam SDK for Java and look through the Java SDK API reference.
Walk through the WordCount examples in the WordCount Example Walkthrough.
Take a self-paced tour through our Learning Resources.
Dive in to some of our favorite Videos and Podcasts.
Join the Beam users@ mailing list.

Please don’t hesitate to reach out if you encounter any issues!

Last updated on 2024/11/25

Have you found everything you were looking for?

Was it all useful and clear? Is there anything that you would like to change? Let us know!