Set up your spark service before submitting the code to a Spark instance. The ODC server for the project is provided by the course 2AMD15.
Dataset: The dataset in this assignment consists of 250 or 1000 vectors generated by a predefined generator java program. Each vector has size of 10000.
Q1: Write code in Spark to import the datasets for the following questions (a) as a data frame/dataset, and, (b) as an RDD. Include the full code in the submission.
Q2: Generate a dataset of 250 vectors. Write code with SparkSQL to find all triples of vectors <X,Y,Z>
from the dataset, with aggregate variance at most τ. You are allowed to use user-defined (aggregate) functions, if you want, but this is not required – alternative approaches might be more efficient.
Report/deliver the following information/code:
- The SQL query (you can keep parameter
$\tau$ in the SQL line) [Report and poster]. - For
$\tau={20,50,310,360,410}$ , a plot showing the number of results (you do not need to include the actual triples in your report) and the execution time on the provided cluster [Report and poster]. - Description of how you analyzed the performance of the SQL, optimization methods you have tried to improve the performance, and their impact in performance [Description in the report, summary in the poster].
- Include the full code in the submission.
Q3: Generate a dataset of 1000 vectors. For <X,Y,Z>
with aggregate variance at most τ, without SparkSQL. Report/deliver the following information/code:
- A brief description of your system architecture. Present your architecture in a concise way using a diagram/figure that includes the data/processing flow and the series of transformations/actions applied on the input data. The functionality of each transformation/action can be explained with a brief text below the figure. We will use/have used such descriptions in the class. You can also find a short example in Appendix A [Report, and summary in the poster].
- Optimization tricks that you used, and their impact in performance (i.e., execution time) [Report, and summary in the poster].
- Also run the code of Q3 on the dataset generated for Q2 (the 250 vectors). Are results of Q3 identical to the results of Q2, for the same τ value? If this is not the case, explain. Also report execution time of the code on the provided cluster [Report and poster].
- A discussion (at most 1/2 page) to compare and explain the difference in performance between Q2 and Q3. Be succinct in your explanation [Report].
- Include the full code in the submission.
Q4: Generate a dataset of 250 vectors. Modify your solution of Q3 (or write a solution from scratch) such that all operations inside Spark are performed on sketches, to approximate the result. Precisely, choose one of the sketches taught in this course, and find a way to use this sketch for representing the original vectors, estimating the aggregate vectors, and finally, estimating the variances inside Spark. Your solution is expected to implement the following functionalities:
- (functionality 1) find all triples of vector ids with an aggregate variance lower than a threshold;
- (functionality 2) find all triples of vector ids with an aggregate variance higher than a threshold.
Please refer to the report.