Skip to content

Commit 4f57b9b

Browse files
added graph_traversal app
1 parent f924f35 commit 4f57b9b

File tree

20 files changed

+924
-393
lines changed

20 files changed

+924
-393
lines changed

doc/app/graph_traversal/Makefile

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
tbb_include=/home/twhuang/PhD/Code/tbb/include/
2+
tbb_library=/home/twhuang/PhD/Code/tbb/build/linux_intel64_gcc_cc7_libc2.27_kernel4.15.0_release/
3+
taskflow_include=/home/twhuang/PhD/Code/cpp-taskflow/
4+
5+
default:
6+
g++ -c seq.cpp -O2 -I $(taskflow_include) -I $(tbb_include) -L $(tbb_library) -ltbb -lpthread -lrt -std=c++17
7+
g++ -c omp.cpp -O2 -I $(taskflow_include) -I $(tbb_include) -L $(tbb_library) -ltbb -lpthread -lrt -std=c++17 -fopenmp
8+
g++ -c taskflow.cpp -O2 -I $(taskflow_include) -I $(tbb_include) -L $(tbb_library) -ltbb -lpthread -lrt -std=c++17
9+
g++ -c tbb.cpp -O2 -I $(taskflow_include) -I $(tbb_include) -L $(tbb_library) -ltbb -lpthread -lrt -std=c++17
10+
g++ -o run main.cpp omp.o tbb.o taskflow.o -O2 -I $(taskflow_include) -I $(tbb_include) -L $(tbb_library) -ltbb -lpthread -lrt -std=c++17 -fopenmp
11+
12+
clean:
13+
rm -rf run *.o

doc/app/graph_traversal/dataset

Lines changed: 400 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
# Graph Traversal
2+
3+
In this page, we demonstrate a common graph traversal application
4+
and compares the performance between OpenMP, TBB, and Cpp-Taskflow.
5+
We consider a `NxM` level graph,
6+
where `N` and `M` are the height and width of the graph, respectively.
7+
Each node has at most four edges joining to the subsequent level.
8+
The figure below shows an example `3x4` level graph.
9+
The tuple `(x, y)` denotes the row and column indices of a node
10+
in a mapped data structure.
11+
12+
![](levelgraph_sample.png)
13+
14+
The goal is to write a parallel graph traversal algorithm
15+
following the edge dependency constraints.
16+
A node can be traversed only if its preceding nodes are all traversed.
17+
The procedure is fully asynchronous - no synchronization between levels.
18+
In practice, one would embed application-specific workloads in each node during the traversal.
19+
20+
We compare with three parallel implementations with Cpp-Taskflow, OpenMP,
21+
and Intel Thread Building Blocks (TBB).
22+
23+
+ [Cpp-Taskflow](#cpp-taskflow)
24+
+ [Intel Thread Building Blocks (TBB)](#intel-thread-building-blocks)
25+
+ [OpenMP](#openmp)
26+
+ [Debrief](#debrief)
27+
28+
# Cpp-Taskflow
29+
30+
The programming model of Cpp-Taskflow flows naturally with the graph structure.
31+
Each node corresponds to a task and each edge corresponds to a dependency between two tasks.
32+
33+
```cpp
34+
tf::Taskflow tf(std::thread::hardware_concurrency());
35+
36+
tasks.resize(graph.level());
37+
for(size_t i=0; i<tasks.size(); ++i) {
38+
tasks[i].resize(graph.length());
39+
}
40+
41+
for(size_t i=0; i<graph.length(); i++){
42+
Node& n = graph.node_at(graph.level()-1, i);
43+
tasks[graph.level()-1][i] = tf.silent_emplace([&](){ n.mark(); });
44+
}
45+
46+
for(int l=graph.level()-2; l>=0 ; l--){
47+
for(size_t i=0; i<graph.length(); i++){
48+
Node& n = graph.node_at(l, i);
49+
tasks[l][i] = tf.silent_emplace([&](){ n.mark();});
50+
for(size_t k=0; k<n._out_edges.size(); k++){
51+
tasks[l][i].precede(tasks[l+1][n._out_edges[k]]);
52+
}
53+
}
54+
}
55+
56+
tf.wait_for_all();
57+
```
58+
59+
The code of this implementation can be found at [taskflow.cpp](taskflow.cpp).
60+
61+
62+
# Intel Thread Building Blocks
63+
64+
The TBB-based implementation is similar to Cpp-Taskflow
65+
except we build the task dependency graph using the `continue_node` type
66+
and the `make_edge` function defined in Intel TBB.
67+
68+
```cpp
69+
using namespace tbb;
70+
using namespace tbb::flow;
71+
tbb::task_scheduler_init init(std::thread::hardware_concurrency());
72+
73+
tbb::flow::graph G;
74+
75+
tasks.resize(graph.level());
76+
for(size_t i=0; i<tasks.size(); ++i) {
77+
tasks[i].resize(graph.length());
78+
}
79+
80+
for(size_t i=0; i<graph.length(); i++){
81+
Node& n = graph.node_at(graph.level()-1, i);
82+
tasks[graph.level()-1][i] = std::make_unique<continue_node<continue_msg>>(G,
83+
[&](const continue_msg&){ n.mark(); }
84+
);
85+
}
86+
87+
for(int l=graph.level()-2; l>=0 ; l--){
88+
for(size_t i=0; i<graph.length(); i++){
89+
Node& n = graph.node_at(l, i);
90+
tasks[l][i] = std::make_unique<continue_node<continue_msg>>(G,
91+
[&](const continue_msg&){ n.mark(); }
92+
);
93+
for(size_t k=0; k<n._out_edges.size(); k++){
94+
make_edge(*tasks[l][i], *tasks[l+1][n._out_edges[k]]);
95+
}
96+
}
97+
}
98+
99+
source = std::make_unique<continue_node<continue_msg>>(G,
100+
[](const continue_msg&){}
101+
);
102+
103+
for(int l=0; l>=0 ; l--) {
104+
for(size_t i=0; i<graph.length(); i++){
105+
Node& n = graph.node_at(l, i);
106+
make_edge(*source, *tasks[l][i]);
107+
}
108+
}
109+
110+
source->try_put(continue_msg());
111+
G.wait_for_all();
112+
```
113+
114+
The code of this implementation can be found at [tbb.cpp](tbb.cpp).
115+
116+
117+
# OpenMP
118+
119+
The OpenMP-based implementation is a bit tricky because the task dependency clause
120+
is *static*.
121+
In order to capture the task dependency of a node,
122+
we need an additional integer vector `out`
123+
to explicitly specify its input and output constraints.
124+
Since the maximum degree of a node is given,
125+
we can work around all possible combinations of input and output degrees.
126+
127+
```cpp
128+
#pragma omp parallel
129+
{
130+
#pragma omp single
131+
{
132+
for(size_t l=0; l<graph.level(); l++){
133+
for(int i=0; i<graph.length(); i++){
134+
Node& n = graph.node_at(l, i);
135+
size_t out_edge_num = n._out_edges.size();
136+
size_t in_edge_num = n._in_edges.size();
137+
138+
switch(in_edge_num){
139+
140+
case(0):{
141+
142+
switch(out_edge_num){
143+
144+
case(1):{
145+
int* out0 = n.edge_ptr(0);
146+
#pragma omp task depend(out: out0[0]) shared(n)
147+
{ n.mark(); }
148+
break;
149+
}
150+
//.....................................
151+
//...........16 switch cases...........
152+
//.....................................
153+
}
154+
}
155+
}
156+
}
157+
```
158+
159+
The code of this implementation can be found at [omp.cpp](omp.cpp).
160+
161+
162+
163+
# Debrief
164+
165+
We evaluated our implementations on a
166+
Linux Ubuntu machine of 4 Intel CPUs 3.2GHz and 24 GB memory.
167+
168+
## Performance
169+
170+
The figure below shows the development cost (measured by [SLOCCount][SLOCCount])
171+
and runtime scalability (versus different graph sizes) of each implementation.
172+
173+
174+
![](performance.png)
175+
176+
We observe both Cpp-Taskflow and TBB are consistently faster than OpenMP.
177+
As the graph size increases, the performance gap between Cpp-Taskflow and TBB
178+
becomes pronounced.
179+
180+
181+
## Software Cost
182+
183+
We use the famous Linux tool [SLOCCount][SLOCCount] to measure the software cost of
184+
each implementation.
185+
The cost model is based on the *constructive cost model* (COCOMO).
186+
In the table below, **SLOC** denotes souce lines of code,
187+
**Dev Effort** denotes development effort estimate (person-months),
188+
**Sched Estimate** denotes schedule estimate (years),
189+
**Developers** denotes the estimate number of developers,
190+
**Dev Cost** denotes total estimated cost to develop.
191+
All quantities are better with fewer values.
192+
193+
| Task Model | SLOC | Dev Effort | Sched Estimate | Developers | Dev Cost |
194+
| :----------: | :--: | :--------: | :------------: | :--------: | :------: |
195+
| Cpp-Taskflow | 40 | 0.01 | 0.08 | 0.08 | $920 |
196+
| Intel TBB | 60 | 0.01 | 0.09 | 0.11 | $1,408 |
197+
| OpenMP 4.5 | 216 | 0.04 | 0.16 | 0.25 | $5,326 |
198+
199+
In terms of software cost, Cpp-Taskflow has the least amount of source lines of code
200+
(40 lines) over TBB (60 lines) and OpenMP (213 lines).
201+
The development cost of Cpp-Taskflow reported by [SLOCCount][SLOCCount]
202+
is about 1.5x and 5.7x fewer than TBB and OpenMP, respectively.
203+
204+
205+
206+
* * *
207+
208+
[GraphvizOnline]: https://dreampuf.github.io/GraphvizOnline/
209+
[SLOCCount]: https://dwheeler.com/sloccount/
210+
Lines changed: 18 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,9 @@
1010
#include <cmath>
1111
#include <memory>
1212

13-
#include "taskflow.hpp"
14-
#include "tbb/task_scheduler_init.h"
15-
#include "tbb/flow_graph.h"
13+
#include <taskflow/taskflow.hpp>
14+
#include <tbb/task_scheduler_init.h>
15+
#include <tbb/flow_graph.h>
1616

1717
class Node{
1818

@@ -24,17 +24,17 @@ class Node{
2424
_out_edges = std::move(next_level_nodes);
2525
}
2626

27-
2827
void mark(){
2928
_visited = true;
30-
//std::cout << "Level:\t" << _level << "\tidx\t" << _idx<< std::endl;
3129
}
3230

3331
void unmark(){
3432
_visited = false;
3533
}
3634

37-
bool check_status(){ return _visited; }
35+
bool check_status() {
36+
return _visited;
37+
}
3838

3939
void print_node(){
4040

@@ -55,19 +55,12 @@ class Node{
5555

5656
}
5757

58-
59-
60-
const int index(){ return _idx; }
58+
int index() const { return _idx; }
6159

6260
int* edge_ptr(int edge_idx) { return &_out_edges[edge_idx]; }
6361

6462
std::vector<std::pair<int, int>> _in_edges;
6563
std::vector<int> _out_edges;
66-
tf::Task _task;
67-
68-
//tbb::flow::continue_node<tbb::flow::continue_msg> *tbb_node;
69-
70-
std::unique_ptr<tbb::flow::continue_node<tbb::flow::continue_msg>> tbb_node;
7164

7265
private:
7366

@@ -77,17 +70,16 @@ class Node{
7770
bool _visited {false};
7871
};
7972

80-
class RegularGraph{
73+
class LevelGraph {
8174

8275
public:
8376

84-
RegularGraph(size_t length, size_t level){
77+
LevelGraph(size_t length, size_t level){
8578

8679
_level_num = level;
8780
_length_num = length;
8881

89-
//std::random_device rd;
90-
std::mt19937 g(0);
82+
std::mt19937 g(0); // fixed the seed
9183

9284
for(size_t l=0; l<level; ++l){
9385

@@ -182,7 +174,6 @@ class RegularGraph{
182174
_graph[l][i].unmark();
183175
}
184176
}
185-
std::cout << "clear graph completed" << std::endl;
186177
}
187178

188179

@@ -202,14 +193,6 @@ class RegularGraph{
202193
return size;
203194
}
204195

205-
void reset_tbb_node(){
206-
for(size_t l=0; l<_level_num; l++){
207-
for(size_t i=0; i<_length_num; i++){
208-
_graph[l][i].tbb_node.reset(nullptr); //invoker destructor through reset
209-
}
210-
}
211-
}
212-
213196
private:
214197

215198
const size_t _edge_max = 4;
@@ -220,4 +203,12 @@ class RegularGraph{
220203

221204
};
222205

206+
std::chrono::microseconds measure_time_taskflow(LevelGraph&);
207+
std::chrono::microseconds measure_time_omp(LevelGraph&);
208+
std::chrono::microseconds measure_time_tbb(LevelGraph&);
209+
210+
211+
212+
213+
223214

File renamed without changes.

doc/app/graph_traversal/main.cpp

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
#include "levelgraph.hpp"
2+
3+
int main() {
4+
5+
double omp_time {0.0};
6+
double tbb_time {0.0};
7+
double tf_time {0.0};
8+
int rounds {10};
9+
10+
for(int i=1; i<=400; i++){
11+
12+
LevelGraph graph(i, i);
13+
14+
for(int j=0; j<rounds; ++j) {
15+
omp_time += measure_time_omp(graph).count();
16+
graph.clear_graph();
17+
18+
tbb_time += measure_time_tbb(graph).count();
19+
graph.clear_graph();
20+
21+
tf_time += measure_time_taskflow(graph).count();
22+
graph.clear_graph();
23+
}
24+
25+
std::cout << std::setw(12) << graph.graph_size()
26+
<< std::setw(12) << omp_time / rounds / 1000.0
27+
<< std::setw(12) << tbb_time / rounds / 1000.0
28+
<< std::setw(12) << tf_time / rounds / 1000.0
29+
<< std::endl;
30+
}
31+
}
32+
33+

0 commit comments

Comments
 (0)