Skip to content

A simple Apache Spark application to process metadata of images taken from Flickr.

Notifications You must be signed in to change notification settings

emredogan7/flickr-metadata-processing-using-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

flickr-metadata-processing

A simple Apache Spark application to process metadata of images taken from Flickr.

CENG790 Big Data Analytics Course Assignment #1.
Assignment details can be found here.

The assignment consists of 2 parts:

  • Processing data using Spark DataFrame API
  • Processing data using RDDs

Notice that, this is kind of a Spark tutorial on Scala and includes introductory data processing applications.

Part 1: Processing data using the DataFrame API

In this part, Flickr metadata is processed through Spark DataFrames. Dataframe is a column-based data structure with a schema. They are quite popular as they support SQL queries.
The main purpose in this part to result with images which include valid GPS information (not null) and have NonDerivative license type.
Source code of this part can be accessed from here.

Part 2: Processing Data using RDDs (Resilient Distributed Datasets):

RDD is a logical reference to a dataset partitioned across many different machines. Data is immutable stored in RDD. For more information on RDDs, check out the paper.
In this part, metadata of images is represented within a RDD of objects. For each image, an object of Picture(see the Picture.scala) is created by using metadata and all Picture objects are kept by using RDD.

Then, all images are grouped with repect to the location information(in which country the picture is taken). And finally, user tags of images and frequency of these tags in Flickr are kept for each country.
Source code of this part can be accessed from here.

A more comprehensive technical report can be accessed from here.

About

A simple Apache Spark application to process metadata of images taken from Flickr.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published