Skip to content

Commit

Permalink
doc: Document Updates (#422)
Browse files Browse the repository at this point in the history
* doc: fix grammar error

* doc: remove memcached private workload

* fix: CI error

* fix: add missing word

* doc: remove unexecutable data

* doc: remove redundent content

* fix(graph-analytcs): memory requirement for remote note.

* made the opening paragraph a bit easier to follow

* grammatical changes for the rest of the document

* Update data-analytics.md

* Update data-caching.md

* Update data-analytics.md

* Update data-serving.md

* Update graph-analytics.md

* Update in-memory-analytics.md

* Update media-streaming.md

* Update web-search.md

* Update web-search.md

* Update web-serving.md

* doc: different log

* doc(web-serving): Database becomes interactive to see download speed

* doc: rephrase words

* doc: consistently use `bash` to highlight code block

* doc: explanation of HDFS

* doc: less ambiguous word

* Update data-analytics.md

* doc: grammatical changes to data-caching

* doc: update typo

* doc: fix typos and consistency

* doc: consistent wording

* doc: improve data caching by consistency

* doc: 99 percentile latency

* doc: web-serving document grammar fix

* refactor: rename the interval distribution for web search

* doc: finalize web search

* docs: finalize data serving

* doc: finalize graph analytics workload

* doc: finalize media streaming

* doc: finalize in-memory ananlytics.

* doc: add dockerfile list

* doc: typo

* doc: update license

* doc: wording

* fix: remove additional QoS

* doc: fix spell checking

* doc: add flag

* doc: missing article

---------

Co-authored-by: Ayan Chakraborty <[email protected]>
  • Loading branch information
xusine and ayanchak1508 authored Mar 26, 2023
1 parent 4e9abd7 commit 3303951
Show file tree
Hide file tree
Showing 13 changed files with 363 additions and 423 deletions.
9 changes: 9 additions & 0 deletions .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -36,15 +36,18 @@ DockerHub
dP
dt
Elgg
entrypoint
epfl
ethernet
faban
FastCGI
fb
filesystem
fpm
frontend
Gbit
GBs
GC
github
GraphX
grouplens
Expand All @@ -61,6 +64,7 @@ IPAddress
JVM
JIT
keyspace
LLC
localhost
login
MapReduce
Expand All @@ -74,6 +78,7 @@ MemcacheServer
memcacheserverdocker
metadata
middleware
microarchitectures
MLlib
Moby
Movielens
Expand Down Expand Up @@ -101,11 +106,13 @@ qemu
QEMU
QoS
README
realtime
Recommender
RECORDCOUNT
repo
runtime
scalability
Skylake
slavedocker
solr
solr's
Expand All @@ -121,9 +128,11 @@ UI
usertable
usr
uwsgi
vectorization
videoperf
VM
VMs
warmup
WebServer
webserverdocker
webserverdocker
Expand Down
7 changes: 3 additions & 4 deletions LICENSE.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ CloudSuite consists of several software components that are governed by various
### Software developed externally (not by the CloudSuite group)

* [Nginx Web Server](http://nginx.org/LICENSE)
* [MySQL DBMS](http://www.gnu.org/licenses/gpl.html)
* [PHP](http://www.php.net/license/3_01.txt)
* [APC (Alternative PHP Cache)](http://www.php.net/license/3_01.txt)
* [Nutch](http://www.apache.org/licenses/LICENSE-2.0)
Expand All @@ -24,12 +23,12 @@ CloudSuite consists of several software components that are governed by various
* [Elgg](https://www.gnu.org/licenses/gpl-2.0.html)

### Software developed internally (by the CloudSuite group)
**CloudSuite 3.0 License**
**CloudSuite 4.0 License**


CloudSuite 3.0 Benchmark Suite
CloudSuite 4.0 Benchmark Suite

Copyright &copy; 2011-2018, Parallel Systems Architecture Lab, EPFL
Copyright &copy; 2011-2023, Parallel Systems Architecture Lab, EPFL
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Expand Down
8 changes: 0 additions & 8 deletions benchmarks/data-caching/client/README.md

This file was deleted.

8 changes: 0 additions & 8 deletions benchmarks/data-caching/server/README.md

This file was deleted.

6 changes: 3 additions & 3 deletions benchmarks/web-search/client/docker-entrypoint.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
arg.add_argument("--interval-max", type=int, help="The maximum interval for request generation, in milliseconds", default=1500)
arg.add_argument("--interval-deviation", type=float, help="The deviation of the interval, in percentage.", default=0)
arg.add_argument("--interval-type", choices=["ThinkTime", "CycleTime"], help="The interval type.", default="ThinkTime")
arg.add_argument("--interval-distribution", choices=["Fixed", "Uniform", "NegativeExponential"], help="The distribution of interval", default="Fixed")
arg.add_argument("--interval-distribution", choices=["Fixed", "Uniform", "NegExp"], help="The distribution of interval", default="Fixed")

arg.add_argument("--dataset-distribution", choices=["Random", "Zipfian"], help="The distribution of the request", default="Zipfian")
arg.add_argument("--output-query-result", "-q", action="store_true", help="Whether let Faban output search query. Can be a potential performance bottleneck.")
Expand Down Expand Up @@ -51,15 +51,15 @@
arg.interval_deviation
))
if arg.interval_min != arg.interval_max:
print("Warning: the maximal interval should be same as the minimal interval when fixed distribution is used. The program uses minimal interval as the fixed interval.")
print("Warning: the maximum interval should be same as the minimum interval when fixed distribution is used. The program uses minimum interval as the fixed interval.")
elif arg.interval_distribution == "Uniform":
f.write("@Uniform(cycleMin = {}, cycleMax = {}, cycleType = CycleType.{}, cycleDeviation = {})\n".format(
arg.interval_min,
arg.interval_max,
arg.interval_type.upper(),
arg.interval_deviation
))
elif arg.interval_distribution == "NegativeExponential":
elif arg.interval_distribution == "NegExp":
f.write("@NegativeExponential(cycleMin = {}, cycleMax = {}, cycleMean = {}, cycleType = CycleType.{}, cycleDeviation = {})\n".format(
arg.interval_min,
arg.interval_max,
Expand Down
42 changes: 21 additions & 21 deletions docs/benchmarks/data-analytics.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,41 +3,37 @@
[![Pulls on DockerHub][dhpulls]][dhrepo]
[![Stars on DockerHub][dhstars]][dhrepo]

The explosion of accessible human-generated information necessitates automated analytical processing to cluster, classify, and filter this information. Hadoop has emerged as a popular approach to handling large-scale analysis with its distributed file system and compute capabilities, allowing it to scale to PetaBytes of data. The Data Analytics benchmark is included in CloudSuite to cover the increasing importance of classification tasks analyzing large amounts of data in datacenters using the MapReduce framework. It is composed of Mahout, a set of machine learning libraries, running on top of Hadoop, an open-source implementation of MapReduce.
The explosion of human-generated information necessitates automated analytical processing to cluster, classify, and filter this information.The Data Analytics benchmark is included in CloudSuite to cover the increasing importance of classification tasks in analyzing large amounts of data in datacenters. It uses the MapReduce framework Hadoop, which is a popular approach for handling large-scale analysis. Its distributed file system and compute capabilities allow it to scale to PetaBytes of data.

The benchmark consists of running a Naive Bayes classifier on a (Wikimedia dataset)[https://dumps.wikimedia.org/backup-index.html]. It uses Hadoop version 2.10.2 and Mahout version 14.1.
This workload is based on Mahout, a set of machine learning libraries running on top of Hadoop. It runs a Naive Bayes classifier on a [Wikimedia dataset](https://dumps.wikimedia.org/backup-index.html), and uses Hadoop version 2.10.2 and Mahout version 14.1.

## Images ##

To obtain the images:
## Dockerfiles

```bash
$ docker pull cloudsuite/data-analytics
$ docker pull cloudsuite/wikimedia-pages-dataset
```
Supported tags and their respective `Dockerfile` tags:
- [`latest`][latestcontainer] contains the application logic.

## Running the benchmark ##

The benchmark is designed to run on a Hadoop cluster, where the single master runs the driver program, and the slaves run the mappers and reducers.
The benchmark is designed to run on a Hadoop cluster, where a single master runs the driver program, and workers run the mappers and reducers.

First, start the container for the dataset:

```bash
$ docker create --name wikimedia-dataset cloudsuite/wikimedia-pages-dataset
```

**Note**: The following commands will start the master for the cluster. To make sure that slaves and master can communicate with each other, the slave container's must point to the master's IP address.

Start the master with:

```bash
$ docker run -d --net host --volumes-from wikimedia-dataset --name data-master cloudsuite/data-analytics --master
```

By default, Hadoop master node is listened on the first interface accessing to network . You can overwrite the listening address by adding `--master-ip=X.X.X.X` to change the setting.
By default, the Hadoop master node is listening on the first interface accessing the network. You can overwrite the listening address by adding `--master-ip=X.X.X.X`.

Start any number of Hadoop slaves with:
```
Start any number of Hadoop workers with:

```bash
$ # on VM1
$ docker run -d --net host --name data-slave01 cloudsuite/data-analytics --slave --master-ip=<IP_ADDRESS_MASTER>

Expand All @@ -46,27 +42,31 @@ $ docker run -d --net host --name data-slave02 cloudsuite/data-analytics --slave

...
```
**Note**: You should set `IP_ADDRESS_MASTER` to master's IP address.

After both master and slave are set up (you can use `docker logs` to observe if the log is still generating), run the benchmark with:
**Note**: You should set `IP_ADDRESS_MASTER` to the master's IP address and make sure that address is accessible from each worker.

After both master and worker are set up (you can use `docker logs` to observe if the log is still being updated), run the benchmark with the following command:

```bash
$ docker exec data-master benchmark
```

### Configuring Hadoop parameters ###

We can configure a few parameters for Hadoop depending on requirements.
A few parameters for Hadoop can be configured depending on requirements.

Hadoop infers the number of workers with how many partitions it created with HDFS. We can increase or reduce the HDFS partition size to `N` mb with `--hdfs-block-size=N`, 128mb being the default. The current dataset included here weights 900MB, thus the default `--hdfs-block-size=128` of 128mb resulting in splits between 1 and 8 parts depending on the benchmark phase.
Hadoop infers the number of workers based on how many partitions it created with HDFS (HaDoop File System, a distributed file system for handing out dataset chunks to workers). You can increase or reduce the HDFS partition size to `N` MB with `--hdfs-block-size=N`, with 128MB being the default. The default dataset weighs 900MB. Thus, depending on the benchmark phase (sequencing, vectorization, pre-training, training, and inference), the default option `--hdfs-block-size=128` results in a split between 1 and 8 parts.

The maximum number of workers is configured by `--yarn-cores=C`, default is 8, if there's more splits than number of workers, YARN will only allow up to `C` workers threads to process them and multiplex the tasks. Please note that **at least 2 cores** should be given for all workers in total: One core for the map operation and another core for the reduce operation. Otherwise, the process can get stuck.
Hadoop relies on [YARN][yarn] (Yet Another Resource Negotiator) to manage its resources, and the maximum number of workers is configured by `--yarn-cores=C`, whose default value is 8. If there are more blocks than the number of workers, YARN will only allow up to `C` worker threads to process them. Please note that **at least two cores** should be given in total: One core for the map operation and another for the reduce operation. Otherwise, the process can get stuck.

The maximum memory used by each worker is configured by `--mapreduce-mem=N`, default is 2096mb. Note that depending on the number of `--yarn-cores=C`, the total actual physical memory required will be of at least `C*N`. You are recommended to allocate 8GB memory (even for single worker with 2 CPUs) in total to avoid out of memory errors.
The maximum memory used by each worker is configured by `--mapreduce-mem=N`, and the default value is 2096MB. Note that depending on the number of `--yarn-cores=C`, the total physical memory required will be at least `C*N`. To avoid out-of-memory errors, we recommend allocating at least 8GB of memory (even for a single worker with two cores) in total.

For increasing total number of workers, please use a bigger dataset from wikimedia. Using a smaller partition sizes than 128 mb will result in increasing number of workers but also will actually slowdown the execution due to overheads of small partition size.
To increase the number of workers, please use a bigger dataset from Wikimedia. Using partition sizes smaller than 128MB can increase the number of workers but slow down the execution due to overheads of the small partition size.


[dhrepo]: https://hub.docker.com/r/cloudsuite/data-analytics/ "DockerHub Page"
[dhpulls]: https://img.shields.io/docker/pulls/cloudsuite/data-analytics.svg "Go to DockerHub Page"
[dhstars]: https://img.shields.io/docker/stars/cloudsuite/data-analytics.svg "Go to DockerHub Page"
[yarn]: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html "YARN explanation"

[latestcontainer]: https://github.com/parsa-epfl/cloudsuite/blob/main/benchmarks/data-analytics/latest/Dockerfile "link to container, github"
Loading

0 comments on commit 3303951

Please sign in to comment.