2AMD15 - Big Data Management: Discovery of Combinations of Vectors with Low Aggregate Variance

Prerequisites

Set up your spark service before submitting the code to a Spark instance. The ODC server for the project is provided by the course 2AMD15.

Project Desciption

Dataset: The dataset in this assignment consists of 250 or 1000 vectors generated by a predefined generator java program. Each vector has size of 10000.

Q1: Write code in Spark to import the datasets for the following questions (a) as a data frame/dataset, and, (b) as an RDD. Include the full code in the submission.

Q2: Generate a dataset of 250 vectors. Write code with SparkSQL to find all triples of vectors <X,Y,Z> from the dataset, with aggregate variance at most τ. You are allowed to use user-defined (aggregate) functions, if you want, but this is not required – alternative approaches might be more efficient. Report/deliver the following information/code:

The SQL query (you can keep parameter $\tau$ in the SQL line) [Report and poster].
For $\tau={20,50,310,360,410}$, a plot showing the number of results (you do not need to include the actual triples in your report) and the execution time on the provided cluster [Report and poster].
Description of how you analyzed the performance of the SQL, optimization methods you have tried to improve the performance, and their impact in performance [Description in the report, summary in the poster].
Include the full code in the submission.

Q3: Generate a dataset of 1000 vectors. For $\tau={20, 410}$, find all triples of labels of the vectors <X,Y,Z> with aggregate variance at most τ, without SparkSQL. Report/deliver the following information/code:

A brief description of your system architecture. Present your architecture in a concise way using a diagram/figure that includes the data/processing flow and the series of transformations/actions applied on the input data. The functionality of each transformation/action can be explained with a brief text below the figure. We will use/have used such descriptions in the class. You can also find a short example in Appendix A [Report, and summary in the poster].
Optimization tricks that you used, and their impact in performance (i.e., execution time) [Report, and summary in the poster].
Also run the code of Q3 on the dataset generated for Q2 (the 250 vectors). Are results of Q3 identical to the results of Q2, for the same τ value? If this is not the case, explain. Also report execution time of the code on the provided cluster [Report and poster].
A discussion (at most 1/2 page) to compare and explain the difference in performance between Q2 and Q3. Be succinct in your explanation [Report].
Include the full code in the submission.

Q4: Generate a dataset of 250 vectors. Modify your solution of Q3 (or write a solution from scratch) such that all operations inside Spark are performed on sketches, to approximate the result. Precisely, choose one of the sketches taught in this course, and find a way to use this sketch for representing the original vectors, estimating the aggregate vectors, and finally, estimating the variances inside Spark. Your solution is expected to implement the following functionalities:

(functionality 1) find all triples of vector ids with an aggregate variance lower than a threshold;
(functionality 2) find all triples of vector ids with an aggregate variance higher than a threshold.

Solutions

Please refer to the report.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
answers		answers
course_docs		course_docs
data		data
scripts		scripts
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2AMD15 - Big Data Management: Discovery of Combinations of Vectors with Low Aggregate Variance

Prerequisites

Project Desciption

Solutions

About

Releases 1

Packages

Languages

License

gitkeniwo/big-data-management

Folders and files

Latest commit

History

Repository files navigation

2AMD15 - Big Data Management: Discovery of Combinations of Vectors with Low Aggregate Variance

Prerequisites

Project Desciption

Solutions

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages