batch-hive

Spring XD Batch Hive Twitter Influencers Sample

This sample will take an input file with Twitter tweets and use the followers attribute to rank the users tweeting. The assumption is that the accounts with the most followers are more influential

Requirements

In order for the sample to run you will need to have installed:

Spring XD 1.1.0.RELEASE or later (Instructions)
Hadoop (Instructions)

If you ran the batch-hashtag-count sample then you already have the tweet files we need for the input. If not, then you can copy the provided sample file before running this sample using the following commands in the XD Shell.

First, configure the XD Shell to use our Hadoop NameNode:

xd:>hadoop config fs --namenode hdfs://localhost:8020

Now we can copy the data file to the /xd/tweets directory.

xd:>hadoop fs mkdir /xd/tweets
xd:>hadoop fs copyFromLocal --from <path-to-this-sample>/data/tweets-0.txt --to /xd/tweets/tweets-0.txt

Note: If you are using a Hadoop distribution that uses a different configuration than the default one from Apache Hadoop, then you need to provide additional configuration settings to be used by any MapReduce tasks submitted to the cluster. See this page for details.

Building

Build the sample simply by executing:

$ mvn package

The project pom declares spring-xd-module-parent as its parent. This adds the dependencies needed to compile and test the module and also configures the Spring Boot Maven Plugin to package the module as an uber-jar, packaging any dependencies that are not already provided by the Spring XD container. In this case there are no additional dependencies so the artifact is built as a common jar. See the Modules section in the Spring XD Reference for more details on module packaging.

The modules src\main\resources\config directory contains the spring-module.xml file that defines the location of the input and output directories for the job. You can verify the settings inside spring-module.xml. All relevant properties are defined in the util:property element:

<util:properties id="myProperties" >
    <prop key="hive.input.path">/xd/tweets</prop>
    <prop key="hive.output.path">/xd/hiveout</prop>
    <prop key="hive.server.host">localhost</prop>
    <prop key="hive.server.port">10000</prop>
</util:properties>

If your Hadoop cluster has a HiveServer running then change the hive.server.host and hive.server.port properties to match your cluster. Note that HiveServer2 won't work since that uses JDBC. This example only works with the Thrift based original HiveServer. You can also start a HiveServer locally, see the HiveServer documentation. To download pre Hive 1.0 releases you need to use the Apache archive.

The same spring-module.xml file also defines the Hadoop configuration, defaulting to what is specified in the Spring XD configuration.

<hadoop:configuration>
    fs.defaultFS=${spring.hadoop.fsUri}
    yarn.resourcemanager.hostname=${spring.hadoop.resourceManagerHost}
    mapreduce.framework.name=yarn
    mapreduce.jobhistory.address=${spring.hadoop.resourceManagerHost}:10020
</hadoop:configuration>

Running the Sample

Now your sample is ready to be executed. The simplest way to run Spring XD is using the singlenode server.

xd/bin>$ ./xd-singlenode

Now start the Spring XD Shell in a separate window:

shell/bin>$ ./xd-shell

Upload the module

In the Spring XD shell:

xd:>module upload --type job --name hiveExample --file <path-to-this-sample>/target/batch-hive-1.0.0.BUILD-SNAPSHOT.jar

Create the Batch Job

You will now create a new Batch Job Stream using the Spring XD Shell:

xd:>job create hiveJob --definition "hiveExample" --deploy

You should see a message:

Successfully created and deployed job 'hiveJob'

Launch the job using:

xd:>job launch hiveJob

You should see a message:

Successfully submitted launch request for job 'hiveJob'

Verify the result

Check that the job is running and wait for it complete:

xd:>job execution list
  Id  Job Name  Start Time               Step Execution Count  Execution Status  Deployment Status  Definition Status
  --  --------  -----------------------  --------------------  ----------------  -----------------  -----------------
  72  hiveJob   2015-02-04 11:21:56,778  2                     STARTED           Deployed           Exists

When the Execution Status changes to COMPLETED you can check the results.

To do that we should configure the XD Shell to use our Hadoop NameNode:

xd:>hadoop config fs --namenode hdfs://localhost:8020

We will now take a look at the results in the HDFS filesystem:

xd:>hadoop fs ls /xd/hiveout
Hadoop configuration changed, re-initializing shell...
Found 1 items
-rw-r--r--   3 trisberg supergroup        192 2015-02-04 11:32 /xd/hiveout/000000_0

Finally, executing:

xd:>hadoop fs cat /xd/hiveout/000000_0

should yield a list of 10 Twitter users and their respective follower counts.

Removing output directory

Depending on your cluster setup, the output directory might have been created by another user like hive. You would need to remove this directory running as that user or with a super user account. On a non a non-secured cluster this can be achieved with:

$ HADOOP_USER_NAME=hive hadoop fs -rm -r /xd/hiveout

Name		Name	Last commit message	Last commit date
parent directory ..
data		data
src/main/resources/config		src/main/resources/config
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batch-hive

batch-hive

README.md

Spring XD Batch Hive Twitter Influencers Sample

Requirements

Building

Running the Sample

Upload the module

Create the Batch Job

Verify the result

Removing output directory

Files

batch-hive

Directory actions

More options

Directory actions

More options

Latest commit

History

batch-hive

Folders and files

parent directory

README.md

Spring XD Batch Hive Twitter Influencers Sample

Requirements

Building

Running the Sample

Upload the module

Create the Batch Job

Verify the result

Removing output directory