Skip to content

gitkeniwo/big-data-management

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

2AMD15 - Big Data Management: Discovery of Combinations of Vectors with Low Aggregate Variance

License: GPL v3 DOI

Prerequisites

Set up your spark service before submitting the code to a Spark instance. The ODC server for the project is provided by the course 2AMD15.

Project Desciption

Dataset: The dataset in this assignment consists of 250 or 1000 vectors generated by a predefined generator java program. Each vector has size of 10000.

Q1: Write code in Spark to import the datasets for the following questions (a) as a data frame/dataset, and, (b) as an RDD. Include the full code in the submission.

Q2: Generate a dataset of 250 vectors. Write code with SparkSQL to find all triples of vectors <X,Y,Z> from the dataset, with aggregate variance at most τ. You are allowed to use user-defined (aggregate) functions, if you want, but this is not required – alternative approaches might be more efficient. Report/deliver the following information/code:

  • The SQL query (you can keep parameter $\tau$ in the SQL line) [Report and poster].
  • For $\tau={20,50,310,360,410}$, a plot showing the number of results (you do not need to include the actual triples in your report) and the execution time on the provided cluster [Report and poster].
  • Description of how you analyzed the performance of the SQL, optimization methods you have tried to improve the performance, and their impact in performance [Description in the report, summary in the poster].
  • Include the full code in the submission.

Q3: Generate a dataset of 1000 vectors. For $\tau={20, 410}$, find all triples of labels of the vectors <X,Y,Z> with aggregate variance at most τ, without SparkSQL. Report/deliver the following information/code:

  • A brief description of your system architecture. Present your architecture in a concise way using a diagram/figure that includes the data/processing flow and the series of transformations/actions applied on the input data. The functionality of each transformation/action can be explained with a brief text below the figure. We will use/have used such descriptions in the class. You can also find a short example in Appendix A [Report, and summary in the poster].
  • Optimization tricks that you used, and their impact in performance (i.e., execution time) [Report, and summary in the poster].
  • Also run the code of Q3 on the dataset generated for Q2 (the 250 vectors). Are results of Q3 identical to the results of Q2, for the same τ value? If this is not the case, explain. Also report execution time of the code on the provided cluster [Report and poster].
  • A discussion (at most 1/2 page) to compare and explain the difference in performance between Q2 and Q3. Be succinct in your explanation [Report].
  • Include the full code in the submission.

Q4: Generate a dataset of 250 vectors. Modify your solution of Q3 (or write a solution from scratch) such that all operations inside Spark are performed on sketches, to approximate the result. Precisely, choose one of the sketches taught in this course, and find a way to use this sketch for representing the original vectors, estimating the aggregate vectors, and finally, estimating the variances inside Spark. Your solution is expected to implement the following functionalities:

  • (functionality 1) find all triples of vector ids with an aggregate variance lower than a threshold;
  • (functionality 2) find all triples of vector ids with an aggregate variance higher than a threshold.

Solutions

Please refer to the report.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages