doc: Document Updates (#422)

* doc: fix grammar error * doc: remove memcached private workload * fix: CI error * fix: add missing word * doc: remove unexecutable data * doc: remove redundent content * fix(graph-analytcs): memory requirement for remote note. * made the opening paragraph a bit easier to follow * grammatical changes for the rest of the document * Update data-analytics.md * Update data-caching.md * Update data-analytics.md * Update data-serving.md * Update graph-analytics.md * Update in-memory-analytics.md * Update media-streaming.md * Update web-search.md * Update web-search.md * Update web-serving.md * doc: different log * doc(web-serving): Database becomes interactive to see download speed * doc: rephrase words * doc: consistently use `bash` to highlight code block * doc: explanation of HDFS * doc: less ambiguous word * Update data-analytics.md * doc: grammatical changes to data-caching * doc: update typo * doc: fix typos and consistency * doc: consistent wording * doc: improve data caching by consistency * doc: 99 percentile latency * doc: web-serving document grammar fix * refactor: rename the interval distribution for web search * doc: finalize web search * docs: finalize data serving * doc: finalize graph analytics workload * doc: finalize media streaming * doc: finalize in-memory ananlytics. * doc: add dockerfile list * doc: typo * doc: update license * doc: wording * fix: remove additional QoS * doc: fix spell checking * doc: add flag * doc: missing article --------- Co-authored-by: Ayan Chakraborty <[email protected]>
parsa-epfl · Mar 26, 2023 · 3303951 · 3303951
1 parent 4e9abd7
commit 3303951
Show file tree

Hide file tree

Showing 13 changed files with 363 additions and 423 deletions.
diff --git a/.wordlist.txt b/.wordlist.txt
@@ -36,15 +36,18 @@ DockerHub
 dP
 dt
 Elgg
+entrypoint
 epfl
 ethernet
 faban
 FastCGI
 fb
 filesystem
 fpm
+frontend
 Gbit
 GBs
+GC
 github
 GraphX
 grouplens
@@ -61,6 +64,7 @@ IPAddress
 JVM
 JIT
 keyspace
+LLC
 localhost
 login
 MapReduce
@@ -74,6 +78,7 @@ MemcacheServer
 memcacheserverdocker
 metadata
 middleware
+microarchitectures
 MLlib
 Moby
 Movielens
@@ -101,11 +106,13 @@ qemu
 QEMU
 QoS
 README
+realtime
 Recommender
 RECORDCOUNT
 repo
 runtime
 scalability
+Skylake
 slavedocker
 solr
 solr's
@@ -121,9 +128,11 @@ UI
 usertable
 usr
 uwsgi
+vectorization
 videoperf
 VM
 VMs
+warmup
 WebServer
 webserverdocker
 webserverdocker

diff --git a/LICENSE.md b/LICENSE.md
@@ -3,7 +3,6 @@ CloudSuite consists of several software components that are governed by various
 ### Software developed externally (not by the CloudSuite group)
 
 * [Nginx Web Server](http://nginx.org/LICENSE)
-* [MySQL DBMS](http://www.gnu.org/licenses/gpl.html)
 * [PHP](http://www.php.net/license/3_01.txt)
 * [APC (Alternative PHP Cache)](http://www.php.net/license/3_01.txt)
 * [Nutch](http://www.apache.org/licenses/LICENSE-2.0)
@@ -24,12 +23,12 @@ CloudSuite consists of several software components that are governed by various
 * [Elgg](https://www.gnu.org/licenses/gpl-2.0.html)
 
 ### Software developed internally (by the CloudSuite group)
-**CloudSuite 3.0 License**
+**CloudSuite 4.0 License**
 
 
-CloudSuite 3.0 Benchmark Suite
+CloudSuite 4.0 Benchmark Suite
 
-Copyright &copy; 2011-2018, Parallel Systems Architecture Lab, EPFL
+Copyright &copy; 2011-2023, Parallel Systems Architecture Lab, EPFL
 All rights reserved.
 
 Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

diff --git a/benchmarks/data-caching/client/README.md b/benchmarks/data-caching/client/README.md
diff --git a/benchmarks/data-caching/server/README.md b/benchmarks/data-caching/server/README.md
diff --git a/benchmarks/web-search/client/docker-entrypoint.py b/benchmarks/web-search/client/docker-entrypoint.py
@@ -12,7 +12,7 @@
 arg.add_argument("--interval-max", type=int, help="The maximum interval for request generation, in milliseconds", default=1500)
 arg.add_argument("--interval-deviation", type=float, help="The deviation of the interval, in percentage.", default=0)
 arg.add_argument("--interval-type", choices=["ThinkTime", "CycleTime"], help="The interval type.", default="ThinkTime")
-arg.add_argument("--interval-distribution", choices=["Fixed", "Uniform", "NegativeExponential"], help="The distribution of interval", default="Fixed")
+arg.add_argument("--interval-distribution", choices=["Fixed", "Uniform", "NegExp"], help="The distribution of interval", default="Fixed")
 
 arg.add_argument("--dataset-distribution", choices=["Random", "Zipfian"], help="The distribution of the request", default="Zipfian")
 arg.add_argument("--output-query-result", "-q", action="store_true", help="Whether let Faban output search query. Can be a potential performance bottleneck.")
@@ -51,15 +51,15 @@
             arg.interval_deviation
         ))
         if arg.interval_min != arg.interval_max:
-            print("Warning: the maximal interval should be same as the minimal interval when fixed distribution is used. The program uses minimal interval as the fixed interval.")
+            print("Warning: the maximum interval should be same as the minimum interval when fixed distribution is used. The program uses minimum interval as the fixed interval.")
     elif arg.interval_distribution == "Uniform":
         f.write("@Uniform(cycleMin = {}, cycleMax = {}, cycleType = CycleType.{}, cycleDeviation = {})\n".format(
             arg.interval_min,
             arg.interval_max,
             arg.interval_type.upper(),
             arg.interval_deviation
         ))
-    elif arg.interval_distribution == "NegativeExponential":
+    elif arg.interval_distribution == "NegExp":
         f.write("@NegativeExponential(cycleMin = {}, cycleMax = {}, cycleMean = {}, cycleType = CycleType.{}, cycleDeviation = {})\n".format(
             arg.interval_min,
             arg.interval_max,

diff --git a/docs/benchmarks/data-analytics.md b/docs/benchmarks/data-analytics.md
@@ -3,41 +3,37 @@
 [![Pulls on DockerHub][dhpulls]][dhrepo]
 [![Stars on DockerHub][dhstars]][dhrepo]
 
-The explosion of accessible human-generated information necessitates automated analytical processing to cluster, classify, and filter this information. Hadoop has emerged as a popular approach to handling large-scale analysis with its distributed file system and compute capabilities, allowing it to scale to PetaBytes of data. The Data Analytics benchmark is included in CloudSuite to cover the increasing importance of classification tasks analyzing large amounts of data in datacenters using the MapReduce framework. It is composed of Mahout, a set of machine learning libraries, running on top of Hadoop, an open-source implementation of MapReduce.
+The explosion of human-generated information necessitates automated analytical processing to cluster, classify, and filter this information.The Data Analytics benchmark is included in CloudSuite to cover the increasing importance of classification tasks in analyzing large amounts of data in datacenters. It uses the MapReduce framework Hadoop, which is a popular approach for handling large-scale analysis. Its distributed file system and compute capabilities allow it to scale to PetaBytes of data. 
 
-The benchmark consists of running a Naive Bayes classifier on a (Wikimedia dataset)[https://dumps.wikimedia.org/backup-index.html]. It uses Hadoop version 2.10.2 and Mahout version 14.1.
+This workload is based on Mahout, a set of machine learning libraries running on top of Hadoop. It runs a Naive Bayes classifier on a [Wikimedia dataset](https://dumps.wikimedia.org/backup-index.html), and uses Hadoop version 2.10.2 and Mahout version 14.1.
 
-## Images ##
 
-To obtain the images:
+## Dockerfiles
 
-```bash
-$ docker pull cloudsuite/data-analytics
-$ docker pull cloudsuite/wikimedia-pages-dataset
-```
+Supported tags and their respective `Dockerfile` tags:
+- [`latest`][latestcontainer] contains the application logic.
 
 ## Running the benchmark ##
 
-The benchmark is designed to run on a Hadoop cluster, where the single master runs the driver program, and the slaves run the mappers and reducers.
+The benchmark is designed to run on a Hadoop cluster, where a single master runs the driver program, and workers run the mappers and reducers.
 
 First, start the container for the dataset:
 
 ```bash
 $ docker create --name wikimedia-dataset cloudsuite/wikimedia-pages-dataset 
 ```
 
-**Note**: The following commands will start the master for the cluster. To make sure that slaves and master can communicate with each other, the slave container's must point to the master's IP address. 
-
 Start the master with:
 
 ```bash
 $ docker run -d --net host --volumes-from wikimedia-dataset --name data-master cloudsuite/data-analytics --master
 ```
 
-By default, Hadoop master node is listened on the first interface accessing to network . You can overwrite the listening address by adding `--master-ip=X.X.X.X` to change the setting.
+By default, the Hadoop master node is listening on the first interface accessing the network. You can overwrite the listening address by adding `--master-ip=X.X.X.X`.
 
-Start any number of Hadoop slaves with:
-```
+Start any number of Hadoop workers with:
+
+```bash
 $ # on VM1
 $ docker run -d --net host --name data-slave01 cloudsuite/data-analytics --slave --master-ip=<IP_ADDRESS_MASTER>
 
@@ -46,27 +42,31 @@ $ docker run -d --net host --name data-slave02 cloudsuite/data-analytics --slave
 
 ...
 ```
-**Note**: You should set `IP_ADDRESS_MASTER` to master's IP address. 
 
-After both master and slave are set up (you can use `docker logs` to observe if the log is still generating), run the benchmark with:
+**Note**: You should set `IP_ADDRESS_MASTER` to the master's IP address and make sure that address is accessible from each worker.
+
+After both master and worker are set up (you can use `docker logs` to observe if the log is still being updated), run the benchmark with the following command:
 
 ```bash
 $ docker exec data-master benchmark
 ```
 
 ### Configuring Hadoop parameters ###
 
-We can configure a few parameters for Hadoop depending on requirements. 
+A few parameters for Hadoop can be configured depending on requirements.
 
-Hadoop infers the number of workers with how many partitions it created with HDFS. We can increase or reduce the HDFS partition size to `N` mb with `--hdfs-block-size=N`, 128mb being the default. The current dataset included here weights 900MB, thus the default `--hdfs-block-size=128` of 128mb resulting in splits between 1 and 8 parts depending on the benchmark phase.
+Hadoop infers the number of workers based on how many partitions it created with HDFS (HaDoop File System, a distributed file system for handing out dataset chunks to workers). You can increase or reduce the HDFS partition size to `N` MB with `--hdfs-block-size=N`, with 128MB being the default. The default dataset weighs 900MB. Thus, depending on the benchmark phase (sequencing, vectorization, pre-training, training, and inference), the default option `--hdfs-block-size=128` results in a split between 1 and 8 parts.
 
-The maximum number of workers is configured by `--yarn-cores=C`, default is 8, if there's more splits than number of workers, YARN will only allow up to `C` workers threads to process them and multiplex the tasks. Please note that **at least 2 cores** should be given for all workers in total: One core for the map operation and another core for the reduce operation. Otherwise, the process can get stuck. 
+Hadoop relies on [YARN][yarn] (Yet Another Resource Negotiator) to manage its resources, and the maximum number of workers is configured by `--yarn-cores=C`, whose default value is 8. If there are more blocks than the number of workers, YARN will only allow up to `C` worker threads to process them. Please note that **at least two cores** should be given in total: One core for the map operation and another for the reduce operation. Otherwise, the process can get stuck. 
 
-The maximum memory used by each worker is configured by `--mapreduce-mem=N`, default is 2096mb. Note that depending on the number of `--yarn-cores=C`, the total actual physical memory required will be of at least `C*N`. You are recommended to allocate 8GB memory (even for single worker with 2 CPUs) in total to avoid out of memory errors.
+The maximum memory used by each worker is configured by `--mapreduce-mem=N`, and the default value is 2096MB. Note that depending on the number of `--yarn-cores=C`, the total physical memory required will be at least `C*N`. To avoid out-of-memory errors, we recommend allocating at least 8GB of memory (even for a single worker with two cores) in total.
 
-For increasing total number of workers, please use a bigger dataset from wikimedia. Using a smaller partition sizes than 128 mb will result in increasing number of workers but also will actually slowdown the execution due to overheads of small partition size. 
+To increase the number of workers, please use a bigger dataset from Wikimedia. Using partition sizes smaller than 128MB can increase the number of workers but slow down the execution due to overheads of the small partition size. 
 
 
 [dhrepo]: https://hub.docker.com/r/cloudsuite/data-analytics/ "DockerHub Page"
 [dhpulls]: https://img.shields.io/docker/pulls/cloudsuite/data-analytics.svg "Go to DockerHub Page"
 [dhstars]: https://img.shields.io/docker/stars/cloudsuite/data-analytics.svg "Go to DockerHub Page"
+[yarn]: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html "YARN explanation"
+
+[latestcontainer]: https://github.com/parsa-epfl/cloudsuite/blob/main/benchmarks/data-analytics/latest/Dockerfile "link to container, github"