forked from taskflow/taskflow
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathchapter6.html
More file actions
165 lines (163 loc) · 37.5 KB
/
chapter6.html
File metadata and controls
165 lines (163 loc) · 37.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
<!-- HTML header for doxygen 1.8.13-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<meta name="generator" content="Doxygen 1.8.14"/>
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<title>Cpp-Taskflow</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<link rel="icon" type="image/x-icon" href="favicon.ico" />
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="navtree.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="resize.js"></script>
<script type="text/javascript" src="navtreedata.js"></script>
<script type="text/javascript" src="navtree.js"></script>
<script type="text/javascript">
/* @license magnet:?xt=urn:btih:cf05388f2679ee054f2beb29a391d25f4e673ac3&dn=gpl-2.0.txt GPL-v2 */
$(document).ready(initResizable);
/* @license-end */</script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td id="projectalign" style="padding-left: 0.5em;">
<div id="projectname"><a href="https://github.com/cpp-taskflow/cpp-taskflow">Cpp-Taskflow</a>
 <span id="projectnumber">2.4-master-branch</span>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.8.14 -->
<script type="text/javascript">
/* @license magnet:?xt=urn:btih:cf05388f2679ee054f2beb29a391d25f4e673ac3&dn=gpl-2.0.txt GPL-v2 */
var searchBox = new SearchBox("searchBox", "search",false,'Search');
/* @license-end */
</script>
<script type="text/javascript" src="menudata.js"></script>
<script type="text/javascript" src="menu.js"></script>
<script type="text/javascript">
/* @license magnet:?xt=urn:btih:cf05388f2679ee054f2beb29a391d25f4e673ac3&dn=gpl-2.0.txt GPL-v2 */
$(function() {
initMenu('',true,false,'search.php','Search');
$(document).ready(function() { init_search(); });
});
/* @license-end */</script>
<div id="main-nav"></div>
</div><!-- top -->
<div id="side-nav" class="ui-resizable side-nav-resizable">
<div id="nav-tree">
<div id="nav-tree-contents">
<div id="nav-sync" class="sync"></div>
</div>
</div>
<div id="splitbar" style="-moz-user-select:none;"
class="ui-resizable-handle">
</div>
</div>
<script type="text/javascript">
/* @license magnet:?xt=urn:btih:cf05388f2679ee054f2beb29a391d25f4e673ac3&dn=gpl-2.0.txt GPL-v2 */
$(document).ready(function(){initNavTree('chapter6.html','');});
/* @license-end */
</script>
<div id="doc-content">
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<div class="header">
<div class="headertitle">
<div class="title">C6: CPU-GPU Tasking </div> </div>
</div><!--header-->
<div class="contents">
<div class="textblock"><p>Modern scientific computing typically leverages GPU-powered parallel processing cores to speed up large-scale applications. This chapters discusses how to implement heterogeneous decomposition algorithms using CPU-GPU collaborative tasking.</p>
<h1><a class="anchor" id="C6_Create_a_cudaFlow"></a>
Create a cudaFlow</h1>
<p>Cpp-Taskflow enables concurrent CPU-GPU tasking by leveraging <a href="https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html">CUDA Graph</a>. The tasking interface is referred to as <em>cudaFlow</em>. A cudaFlow is a graph object of type <a class="el" href="classtf_1_1cudaFlow.html" title="methods for building a CUDA task dependency graph. ">tf::cudaFlow</a> created at runtime similar to dynamic tasking. It manages a task node in a taskflow and associates it with a CUDA Graph. To create a cudaFlow, emplace a callable with an argument of type <a class="el" href="classtf_1_1cudaFlow.html" title="methods for building a CUDA task dependency graph. ">tf::cudaFlow</a>. The following example implements the canonical saxpy (A·X Plus Y) task graph.</p>
<div class="fragment"><div class="line"> 1: #include <taskflow/taskflow.hpp></div><div class="line"> 2: </div><div class="line"> 3: <span class="comment">// saxpy (single-precision A·X Plus Y) kernel</span></div><div class="line"> 4: __global__ <span class="keywordtype">void</span> saxpy(<span class="keywordtype">int</span> n, <span class="keywordtype">float</span> a, <span class="keywordtype">float</span> *x, <span class="keywordtype">float</span> *y) {</div><div class="line"> 5: <span class="keywordtype">int</span> i = blockIdx.x*blockDim.x + threadIdx.x;</div><div class="line"> 6: <span class="keywordflow">if</span> (i < n) {</div><div class="line"> 7: y[i] = a*x[i] + y[i];</div><div class="line"> 8: }</div><div class="line"> 9: }</div><div class="line">10:</div><div class="line">11: <span class="comment">// main function begins</span></div><div class="line">12: <span class="keywordtype">int</span> main() {</div><div class="line">13:</div><div class="line">14: <a class="code" href="classtf_1_1Taskflow.html">tf::Taskflow</a> taskflow;</div><div class="line">15: <a class="code" href="classtf_1_1Executor.html">tf::Executor</a> executor;</div><div class="line">16: </div><div class="line">17: <span class="keyword">const</span> <span class="keywordtype">unsigned</span> N = 1<<20; <span class="comment">// size of the vector</span></div><div class="line">18:</div><div class="line">19: <a class="codeRef" doxygen="/Users/twhuang/PhD/Code/cpp-taskflow/doxygen/cppreference-doxygen-web.tag.xml:http://en.cppreference.com/w/" href="http://en.cppreference.com/w/cpp/container/vector.html">std::vector<float></a> hx(N, 1.0f); <span class="comment">// x vector at host</span></div><div class="line">20: <a class="codeRef" doxygen="/Users/twhuang/PhD/Code/cpp-taskflow/doxygen/cppreference-doxygen-web.tag.xml:http://en.cppreference.com/w/" href="http://en.cppreference.com/w/cpp/container/vector.html">std::vector<float></a> hy(N, 2.0f); <span class="comment">// y vector at host</span></div><div class="line">21:</div><div class="line">22: <span class="keywordtype">float</span> *dx{<span class="keyword">nullptr</span>}; <span class="comment">// x vector at device</span></div><div class="line">23: <span class="keywordtype">float</span> *dy{<span class="keyword">nullptr</span>}; <span class="comment">// y vector at device</span></div><div class="line">24: </div><div class="line">25: <a class="code" href="classtf_1_1Task.html">tf::Task</a> allocate_x = taskflow.<a class="code" href="classtf_1_1FlowBuilder.html#a796e29175380f70246cf2a5639adc437">emplace</a>(</div><div class="line">26: [&](){ cudaMalloc(&dx, N*<span class="keyword">sizeof</span>(<span class="keywordtype">float</span>));}</div><div class="line">27: );</div><div class="line">28:</div><div class="line">29: <a class="code" href="classtf_1_1Task.html">tf::Task</a> allocate_y = taskflow.<a class="code" href="classtf_1_1FlowBuilder.html#a796e29175380f70246cf2a5639adc437">emplace</a>(</div><div class="line">30: [&](){ cudaMalloc(&dy, N*<span class="keyword">sizeof</span>(<span class="keywordtype">float</span>));}</div><div class="line">31: );</div><div class="line">32:</div><div class="line">33: <a class="code" href="classtf_1_1Task.html">tf::Task</a> cudaflow = taskflow.<a class="code" href="classtf_1_1FlowBuilder.html#a796e29175380f70246cf2a5639adc437">emplace</a>([&](<a class="code" href="classtf_1_1cudaFlow.html">tf::cudaFlow</a>& cf) {</div><div class="line">34: <span class="comment">// create data transfer tasks</span></div><div class="line">35: <a class="code" href="classtf_1_1cudaTask.html">tf::cudaTask</a> h2d_x = cf.<a class="code" href="classtf_1_1cudaFlow.html#af03e04771b655f9e629eb4c22e19b19f">copy</a>(dx, hx.data(), N); <span class="comment">// host-to-device x data transfer</span></div><div class="line">36: <a class="code" href="classtf_1_1cudaTask.html">tf::cudaTask</a> h2d_y = cf.<a class="code" href="classtf_1_1cudaFlow.html#af03e04771b655f9e629eb4c22e19b19f">copy</a>(dy, hy.data(), N); <span class="comment">// host-to-device y data transfer</span></div><div class="line">37: <a class="code" href="classtf_1_1cudaTask.html">tf::cudaTask</a> d2h_x = cf.<a class="code" href="classtf_1_1cudaFlow.html#af03e04771b655f9e629eb4c22e19b19f">copy</a>(hx.data(), dx, N); <span class="comment">// device-to-host x data transfer</span></div><div class="line">38: <a class="code" href="classtf_1_1cudaTask.html">tf::cudaTask</a> d2h_y = cf.<a class="code" href="classtf_1_1cudaFlow.html#af03e04771b655f9e629eb4c22e19b19f">copy</a>(hy.data(), dy, N); <span class="comment">// device-to-host y data transfer</span></div><div class="line">39:</div><div class="line">40: <span class="comment">// launch saxpy<<<(N+255)/256, 256, 0>>>(N, 2.0f, dx, dy)</span></div><div class="line">41: <a class="code" href="classtf_1_1cudaTask.html">tf::cudaTask</a> kernel = cf.<a class="code" href="classtf_1_1cudaFlow.html#adb731be71bdd436dfb5e36e6213a9a17">kernel</a>((N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy);</div><div class="line">42:</div><div class="line">43: kernel.<a class="code" href="classtf_1_1cudaTask.html#a4a9ca1a34bac47e4c9b04eb4fb2f7775">succeed</a>(h2d_x, h2d_y)</div><div class="line">44: .<a class="code" href="classtf_1_1cudaTask.html#abdd68287ec4dff4216af34d1db44d1b4">precede</a>(d2h_x, d2h_y);</div><div class="line">45: });</div><div class="line">46: cudaflow.<a class="code" href="classtf_1_1Task.html#a331b1b726555072e7c7d10941257f664">succeed</a>(allocate_x, allocate_y); <span class="comment">// overlap data allocations</span></div><div class="line">47: </div><div class="line">48: executor.<a class="code" href="classtf_1_1Executor.html#a81f35d5b0a20ac0646447eb80d97c0aa">run</a>(taskflow).wait();</div><div class="line">49:</div><div class="line">50: taskflow.<a class="code" href="classtf_1_1Taskflow.html#ac433018262e44b12c4cc9f0c4748d758">dump</a>(<a class="codeRef" doxygen="/Users/twhuang/PhD/Code/cpp-taskflow/doxygen/cppreference-doxygen-web.tag.xml:http://en.cppreference.com/w/" href="http://en.cppreference.com/w/cpp/io/basic_ostream.html">std::cout</a>); <span class="comment">// dump the taskflow</span></div><div class="line">51: }</div></div><!-- fragment --><div class="image">
<object type="image/svg+xml" data="saxpy.svg" width="50%">saxpy.svg</object>
</div>
<p>Debrief:</p>
<ul>
<li>Lines 3-9 define a saxpy kernel using CUDA </li>
<li>Lines 19-20 declare two host vectors, <code>hx</code> and <code>hy</code> </li>
<li>Lines 22-23 declare two device vector pointers, <code>dx</code> and <code>dy</code> </li>
<li>Lines 25-31 declare two tasks to allocate memory for <code>dx</code> and <code>dy</code> on device, each of <code>N*sizeof(float)</code> bytes </li>
<li>Lines 33-45 create a cudaFlow to capture kernel work in a graph (two host-to-device data transfer tasks, one saxpy kernel task, and two device-to-host data transfer tasks) </li>
<li>Lines 46-48 define the task dependency between host tasks and the cudaFlow tasks and execute the taskflow</li>
</ul>
<p>Cpp-Taskflow does not expend unnecessary efforts on kernel programming but focus on tasking CUDA operations with CPU work. We give users full privileges to craft a CUDA kernel that is commensurate with their domain knowledge. Users focus on developing high-performance kernels using a native CUDA toolkit, while leaving difficult task parallelism to Cpp-Taskflow.</p>
<h1><a class="anchor" id="C6_Compile_a_cudaFlow_program"></a>
Compile a cudaFlow Program</h1>
<p>Use <a href="https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html">nvcc</a> (at least v10) to compile a cudaFlow program:</p>
<div class="fragment"><div class="line">~$ nvcc my_cudaflow.cu -I path/to/include/taskflow -O2 -o my_cudaflow</div><div class="line">~$ ./my_cudaflow</div></div><!-- fragment --><p>Our source autonomously enables cudaFlow when detecting a CUDA compiler.</p>
<h1><a class="anchor" id="C6_configure_the_number_of_gpu_workers"></a>
Configure the Number of GPU workers</h1>
<p>By default, the executor spawns one worker per GPU. We dedicate a worker set to each heterogeneous domain, for example, host domain and CUDA domain. If your systems has 4 CPU cores and 2 GPUs, the default number of workers spawned by the executor is 4+2, where 4 workers run CPU tasks and 2 workers run GPU tasks (cudaFlow). You can construct an executor with different numbers of GPU workers.</p>
<div class="fragment"><div class="line"><a class="code" href="classtf_1_1Executor.html">tf::Executor</a> executor(17, 8); <span class="comment">// 17 CPU workers and 8 GPU workers</span></div></div><!-- fragment --><p>The above executor spawns 17 and 8 workers for running CPU and GPU tasks, respectively. These workers coordinate with each other to balance the load in a work-stealing loop highly optimized for performance.</p>
<h1><a class="anchor" id="C6_run_a_cudaflow_on_multiple_gpus"></a>
Run a cudaFlow on Multiple GPUs</h1>
<p>You can run a cudaFlow on multiple GPUs by explicitly associating a cudaFlow or a kernel task with a CUDA device. A CUDA device is an integer number in the range of <code>[0, N)</code> representing the identifier of a GPU, where <code>N</code> is the number of GPUs in a system. The code below creates a cudaFlow that runs on the GPU device 2 through <code>my_stream</code>.</p>
<div class="fragment"><div class="line">taskflow.<a class="code" href="classtf_1_1FlowBuilder.html#a796e29175380f70246cf2a5639adc437">emplace</a>([](<a class="code" href="classtf_1_1cudaFlow.html">tf::cudaFlow</a>& cf) {</div><div class="line"> cf.<a class="code" href="classtf_1_1cudaFlow.html#ad8c0664e4dc3748f043eaa31b69c11cc">device</a>(2);</div><div class="line"> cf.<a class="code" href="classtf_1_1cudaFlow.html#a5ccc24918db4d63c50f26b68d17fd452">stream</a>(my_stream); <span class="comment">// by default, a cudaFlow runs on a per-worker stream managed by the executor</span></div><div class="line"> <span class="comment">// adding more cudaTasks below (all tasks are placed on GPU 2 unless specified explicitly)</span></div><div class="line">});</div></div><!-- fragment --><p>You can place a kernel on a device explicitly through the method <a class="el" href="classtf_1_1cudaFlow.html#a4a839dbaa01237a440edfebe8faf4e5b" title="creates a kernel task on a device ">tf::cudaFlow::kernel_on</a> that takes the device identifier in the first argument.</p>
<div class="fragment"><div class="line"> 1: #include <taskflow/taskflow.hpp></div><div class="line"> 2: </div><div class="line"> 3: <span class="comment">// saxpy (single-precision A·X Plus Y) kernel</span></div><div class="line"> 4: __global__ <span class="keywordtype">void</span> saxpy(<span class="keywordtype">int</span> n, <span class="keywordtype">int</span> a, <span class="keywordtype">int</span> *x, <span class="keywordtype">int</span> *y, <span class="keywordtype">int</span> *z) {</div><div class="line"> 5: <span class="keywordtype">int</span> i = blockIdx.x*blockDim.x + threadIdx.x;</div><div class="line"> 6: <span class="keywordflow">if</span> (i < n) {</div><div class="line"> 7: z[i] = a*x[i] + y[i];</div><div class="line"> 8: }</div><div class="line"> 9: }</div><div class="line">10:</div><div class="line">11: <span class="keywordtype">int</span> main() {</div><div class="line">12:</div><div class="line">13: <span class="keyword">const</span> <span class="keywordtype">unsigned</span> N = 1<<20;</div><div class="line">14: </div><div class="line">15: <span class="keywordtype">int</span>* dx {<span class="keyword">nullptr</span>};</div><div class="line">16: <span class="keywordtype">int</span>* dy {<span class="keyword">nullptr</span>};</div><div class="line">17: <span class="keywordtype">int</span>* z1 {<span class="keyword">nullptr</span>};</div><div class="line">18: <span class="keywordtype">int</span>* z2 {<span class="keyword">nullptr</span>};</div><div class="line">19: </div><div class="line">20: cudaMallocManaged(&dx, N*<span class="keyword">sizeof</span>(<span class="keywordtype">int</span>)); <span class="comment">// create a unified memory block for x</span></div><div class="line">21: cudaMallocManaged(&dy, N*<span class="keyword">sizeof</span>(<span class="keywordtype">int</span>)); <span class="comment">// create a unified memory block for y</span></div><div class="line">22: cudaMallocManaged(&z1, N*<span class="keyword">sizeof</span>(<span class="keywordtype">int</span>)); <span class="comment">// result of saxpy task 1</span></div><div class="line">23: cudaMallocManaged(&z2, N*<span class="keyword">sizeof</span>(<span class="keywordtype">int</span>)); <span class="comment">// result of saxpy task 2</span></div><div class="line">24: </div><div class="line">25: <span class="keywordflow">for</span>(<span class="keywordtype">unsigned</span> i=0; i<N; ++i) {</div><div class="line">26: dx[i] = 1;</div><div class="line">27: dy[i] = 2;</div><div class="line">28: }</div><div class="line">29:</div><div class="line">30: <a class="code" href="classtf_1_1Taskflow.html">tf::Taskflow</a> taskflow;</div><div class="line">31: <a class="code" href="classtf_1_1Executor.html">tf::Executor</a> executor;</div><div class="line">32: </div><div class="line">33: taskflow.<a class="code" href="classtf_1_1FlowBuilder.html#a796e29175380f70246cf2a5639adc437">emplace</a>([&](<a class="code" href="classtf_1_1cudaFlow.html">tf::cudaFlow</a>& cf){</div><div class="line">34: <span class="comment">// launch the cudaFlow on GPU 0</span></div><div class="line">35: cf.<a class="code" href="classtf_1_1cudaFlow.html#ad8c0664e4dc3748f043eaa31b69c11cc">device</a>(0);</div><div class="line">36:</div><div class="line">37: <span class="comment">// launch the first saxpy kernel on GPU 1</span></div><div class="line">38: cf.<a class="code" href="classtf_1_1cudaFlow.html#a4a839dbaa01237a440edfebe8faf4e5b">kernel_on</a>(1, (N+255)/256, 256, 0, saxpy, N, 2, dx, dy, z1);</div><div class="line">39:</div><div class="line">40: <span class="comment">// launch the second saxpy kernel on GPU 3</span></div><div class="line">41: cf.<a class="code" href="classtf_1_1cudaFlow.html#a4a839dbaa01237a440edfebe8faf4e5b">kernel_on</a>(3, (N+255)/256, 256, 0, saxpy, N, 2, dx, dy, z2);</div><div class="line">42: });</div><div class="line">43:</div><div class="line">44: executor.<a class="code" href="classtf_1_1Executor.html#a81f35d5b0a20ac0646447eb80d97c0aa">run</a>(taskflow).wait();</div><div class="line">45:</div><div class="line">46: cudaFree(dx);</div><div class="line">47: cudaFree(dy);</div><div class="line">48: </div><div class="line">49: <span class="comment">// verify the solution; max_error should be zero</span></div><div class="line">50: <span class="keywordtype">int</span> max_error = 0;</div><div class="line">51: <span class="keywordflow">for</span> (<span class="keywordtype">size_t</span> i = 0; i < N; i++) {</div><div class="line">52: max_error = <a class="codeRef" doxygen="/Users/twhuang/PhD/Code/cpp-taskflow/doxygen/cppreference-doxygen-web.tag.xml:http://en.cppreference.com/w/" href="http://en.cppreference.com/w/cpp/algorithm/max.html">std::max</a>(max_error, abs(z1[i]-4));</div><div class="line">53: max_error = <a class="codeRef" doxygen="/Users/twhuang/PhD/Code/cpp-taskflow/doxygen/cppreference-doxygen-web.tag.xml:http://en.cppreference.com/w/" href="http://en.cppreference.com/w/cpp/algorithm/max.html">std::max</a>(max_error, abs(z2[i]-4));</div><div class="line">54: }</div><div class="line">55: <a class="codeRef" doxygen="/Users/twhuang/PhD/Code/cpp-taskflow/doxygen/cppreference-doxygen-web.tag.xml:http://en.cppreference.com/w/" href="http://en.cppreference.com/w/cpp/io/basic_ostream.html">std::cout</a> << <span class="stringliteral">"saxpy finished with max error: "</span> << max_error << <span class="charliteral">'\n'</span>;</div><div class="line">56: }</div></div><!-- fragment --><p>Debrief:</p>
<ul>
<li>Lines 3-9 define a CUDA saxpy kernel that stores the result to z <br />
</li>
<li>Lines 15-23 declare four unified memory blocks accessible from any processor </li>
<li>Lines 25-28 initialize <code>dx</code> and <code>dy</code> blocks by CPU </li>
<li>Lines 33-42 create a cudaFlow task </li>
<li>Lines 34-35 associate the cudaFlow on GPU 0 </li>
<li>Lines 37-38 create a kernel task to launch the first saxpy on GPU 1 and store the result in <code>z1</code> </li>
<li>Lines 40-41 create a kernel task to launch the second saxpy on GPU 3 and store the result in <code>z2</code> </li>
<li>Lines 44-55 run the taskflow and verify the result (<code>max_error</code> should be zero)</li>
</ul>
<p>Running the program gives the following <a href="https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html">nvidia-smi</a> snapshot in a system of 4 GPUs:</p>
<div class="fragment"><div class="line">+-----------------------------------------------------------------------------+</div><div class="line">| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |</div><div class="line">|-------------------------------+----------------------+----------------------+</div><div class="line">| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |</div><div class="line">| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |</div><div class="line">|===============================+======================+======================|</div><div class="line">| 0 GeForce RTX 208... Off | 00000000:18:00.0 Off | N/A |</div><div class="line">| 32% 35C P2 68W / 250W | 163MiB / 11019MiB | 0% Default |</div><div class="line">+-------------------------------+----------------------+----------------------+</div><div class="line">| 1 GeForce RTX 208... Off | 00000000:3B:00.0 Off | N/A |</div><div class="line">| 33% 43C P2 247W / 250W | 293MiB / 11019MiB | 100% Default |</div><div class="line">+-------------------------------+----------------------+----------------------+</div><div class="line">| 2 GeForce RTX 208... Off | 00000000:86:00.0 Off | N/A |</div><div class="line">| 32% 37C P0 72W / 250W | 10MiB / 11019MiB | 0% Default |</div><div class="line">+-------------------------------+----------------------+----------------------+</div><div class="line">| 3 GeForce RTX 208... Off | 00000000:AF:00.0 Off | N/A |</div><div class="line">| 31% 43C P2 245W / 250W | 293MiB / 11019MiB | 100% Default |</div><div class="line">+-------------------------------+----------------------+----------------------+</div><div class="line"> </div><div class="line">+-----------------------------------------------------------------------------+</div><div class="line">| Processes: GPU Memory |</div><div class="line">| GPU PID Type Process name Usage |</div><div class="line">|=============================================================================|</div><div class="line">| 0 53869 C ./a.out 153MiB |</div><div class="line">| 1 53869 C ./a.out 155MiB |</div><div class="line">| 3 53869 C ./a.out 155MiB |</div><div class="line">+-----------------------------------------------------------------------------+</div></div><!-- fragment --><p>Even if cudaFlow provides interface for device placement, it is your responsibility to ensure correct memory access. For example, you may not allocate a memory block on GPU 2 using <code>cudaMalloc</code> and access it from a kernel on GPU 1. A safe practice is to allocate unified memory blocks using <code>cudaMallocManaged</code> and let the CUDA runtime perform automatic memory migration between processors (as demonstrated in the code example above).</p>
<p>As the same example, you may create two cudaFlows for the two kernels on two GPUs, respectively. The overhead of creating a kernel on the same device as a cudaFlow is much less than the different one.</p>
<div class="fragment"><div class="line"><a class="code" href="classtf_1_1Task.html">tf::Task</a> cudaFlow_on_gpu1 = taskflow.<a class="code" href="classtf_1_1FlowBuilder.html#a796e29175380f70246cf2a5639adc437">emplace</a>([&](<a class="code" href="classtf_1_1cudaFlow.html">tf::cudaFlow</a>& cf){</div><div class="line"> cf.<a class="code" href="classtf_1_1cudaFlow.html#ad8c0664e4dc3748f043eaa31b69c11cc">device</a>(1);</div><div class="line"> cf.<a class="code" href="classtf_1_1cudaFlow.html#adb731be71bdd436dfb5e36e6213a9a17">kernel</a>((N+255)/256, 256, 0, saxpy, N, 2, dx, dy, z1);</div><div class="line">});</div><div class="line"></div><div class="line"><a class="code" href="classtf_1_1Task.html">tf::Task</a> cudaFlow_on_gpu3 = taskflow.<a class="code" href="classtf_1_1FlowBuilder.html#a796e29175380f70246cf2a5639adc437">emplace</a>([&](<a class="code" href="classtf_1_1cudaFlow.html">tf::cudaFlow</a>& cf){</div><div class="line"> cf.<a class="code" href="classtf_1_1cudaFlow.html#ad8c0664e4dc3748f043eaa31b69c11cc">device</a>(3);</div><div class="line"> cf.<a class="code" href="classtf_1_1cudaFlow.html#adb731be71bdd436dfb5e36e6213a9a17">kernel</a>((N+255)/256, 256, 0, saxpy, N, 2, dx, dy, z2);</div><div class="line">});</div></div><!-- fragment --><h1><a class="anchor" id="C6_GPUMemoryOperations"></a>
GPU Memory Operations</h1>
<p><a class="el" href="classtf_1_1cudaFlow.html" title="methods for building a CUDA task dependency graph. ">cudaFlow</a> provides a set of methods for users to manipulate device memory data. There are two categories, raw data and typed data. Raw data operations are methods with prefix <code>mem</code>, such as <code>memcpy</code> and <code>memset</code>, that take action on a device memory area in <em>bytes</em>. Typed data operations such as <code>copy</code>, <code>fill</code>, and <code>zero</code>, take <em>logical count</em> of elements. For instance, the following three methods have the same result of zeroing <code>sizeof(int)*count</code> bytes of the device memory area pointed by <code>target</code>.</p>
<div class="fragment"><div class="line"><span class="keywordtype">int</span>* target;</div><div class="line">cudaMalloc(&target, count*<span class="keyword">sizeof</span>(<span class="keywordtype">int</span>));</div><div class="line"></div><div class="line">taskflow.<a class="code" href="classtf_1_1FlowBuilder.html#a796e29175380f70246cf2a5639adc437">emplace</a>([&](<a class="code" href="classtf_1_1cudaFlow.html">tf::cudaFlow</a>& cf){</div><div class="line"> <a class="code" href="classtf_1_1cudaTask.html">tf::cudaTask</a> memset_target = cf.<a class="code" href="classtf_1_1cudaFlow.html#a079ca65da35301e5aafd45878a19e9d2">memset</a>(target, 0, <span class="keyword">sizeof</span>(<span class="keywordtype">int</span>) * count);</div><div class="line"> <a class="code" href="classtf_1_1cudaTask.html">tf::cudaTask</a> same_as_above = cf.<a class="code" href="classtf_1_1cudaFlow.html#aee1fa4aff12a41737ea585fa2e106a35">fill</a>(target, 0, count);</div><div class="line"> <a class="code" href="classtf_1_1cudaTask.html">tf::cudaTask</a> same_as_above_again = cf.<a class="code" href="classtf_1_1cudaFlow.html#a91c1739bb9a2832f306f3d12693a0994">zero</a>(target, count);</div><div class="line">});</div></div><!-- fragment --><p>The method <a class="el" href="classtf_1_1cudaFlow.html#aee1fa4aff12a41737ea585fa2e106a35" title="creates a fill task that fills a typed memory block with a value ">cudaFlow::fill</a> is a more powerful version of <a class="el" href="classtf_1_1cudaFlow.html#a079ca65da35301e5aafd45878a19e9d2" title="creates a memset task ">cudaFlow::memset</a>. It can fill a memory area with any value of type <code>T</code>, given that <code>sizeof(T)</code> is 1, 2, or 4 bytes. For example, the following code sets each element in the array <code>target</code> to 1234.</p>
<div class="fragment"><div class="line">taskflow.<a class="code" href="classtf_1_1FlowBuilder.html#a796e29175380f70246cf2a5639adc437">emplace</a>([&](<a class="code" href="classtf_1_1cudaFlow.html">tf::cudaFlow</a>& cf){</div><div class="line"> cf.<a class="code" href="classtf_1_1cudaFlow.html#aee1fa4aff12a41737ea585fa2e106a35">fill</a>(target, 1234, count);</div><div class="line">});</div></div><!-- fragment --><p>Similar concept applies to <a class="el" href="classtf_1_1cudaFlow.html#ad37637606f0643f360e9eda1f9a6e559" title="creates a memcpy task ">cudaFlow::memcpy</a> and <a class="el" href="classtf_1_1cudaFlow.html#af03e04771b655f9e629eb4c22e19b19f" title="creates a copy task ">cudaFlow::copy</a> as well.</p>
<div class="fragment"><div class="line">taskflow.<a class="code" href="classtf_1_1FlowBuilder.html#a796e29175380f70246cf2a5639adc437">emplace</a>([&](<a class="code" href="classtf_1_1cudaFlow.html">tf::cudaFlow</a>& cf){</div><div class="line"> <a class="code" href="classtf_1_1cudaTask.html">tf::cudaTask</a> memcpy_target = cf.<a class="code" href="classtf_1_1cudaFlow.html#ad37637606f0643f360e9eda1f9a6e559">memcpy</a>(target, source, <span class="keyword">sizeof</span>(<span class="keywordtype">int</span>) * count);</div><div class="line"> <a class="code" href="classtf_1_1cudaTask.html">tf::cudaTask</a> same_as_above = cf.<a class="code" href="classtf_1_1cudaFlow.html#af03e04771b655f9e629eb4c22e19b19f">copy</a>(target, source, count);</div><div class="line">});</div></div><!-- fragment --><h1><a class="anchor" id="C6_LaunchcudaFlowRepeatedly"></a>
Iterate a cudaFlow</h1>
<p>You can create a cudaFlow once and launch it multiple times using <a class="el" href="classtf_1_1cudaFlow.html#a1eeebb4bbd6436a3145ff950ce282ac4" title="repeats the execution of the cudaFlow by n times ">cudaFlow::repeat</a> or <a class="el" href="classtf_1_1cudaFlow.html#adbd46a1ef9f5ae9e0848ccbefa1e65ee" title="assigns a predicate to loop the cudaFlow until the predicate is satisfied ">cudaFlow::predicate</a>, given that the graph parameters remain <em>unchanged</em> across all iterations.</p>
<div class="fragment"><div class="line">taskflow.<a class="code" href="classtf_1_1FlowBuilder.html#a796e29175380f70246cf2a5639adc437">emplace</a>([&] (<a class="code" href="classtf_1_1cudaFlow.html">tf::cudaFlow</a>& cf) {</div><div class="line"> <span class="comment">// construct the GPU task dependency graph ...</span></div><div class="line"> </div><div class="line"> <span class="comment">// launch the cudaFlow 10 times</span></div><div class="line"> cf.<a class="code" href="classtf_1_1cudaFlow.html#a1eeebb4bbd6436a3145ff950ce282ac4">repeat</a>(10);</div><div class="line"></div><div class="line"> <span class="comment">// equivalently</span></div><div class="line"> cf.<a class="code" href="classtf_1_1cudaFlow.html#adbd46a1ef9f5ae9e0848ccbefa1e65ee">predicate</a>([n=10] () <span class="keyword">mutable</span> { <span class="keywordflow">return</span> n-- == 0; });</div><div class="line">});</div></div><!-- fragment --><p>The executor iterate the execution of the cudaFlow until the predicate evaluates to <code>true</code>.</p>
<h1><a class="anchor" id="C6_Granularity"></a>
Granularity</h1>
<p>Creating a cudaFlow has certain overhead, which means fined-grained tasking such as one GPU operation per cudaFlow may not give you any performance gain. You should aggregate as many GPU operations as possible in a cudaFlow to launch the entire graph once instead of separate calls. For example, the following code creates the saxpy task graph at a very fine-grained level using one cudaFlow per GPU operation.</p>
<div class="fragment"><div class="line"><a class="code" href="classtf_1_1Task.html">tf::Task</a> h2d_x = taskflow.<a class="code" href="classtf_1_1FlowBuilder.html#a796e29175380f70246cf2a5639adc437">emplace</a>([&](<a class="code" href="classtf_1_1cudaFlow.html">tf::cudaFlow</a>& cf) {</div><div class="line"> cf.<a class="code" href="classtf_1_1cudaFlow.html#af03e04771b655f9e629eb4c22e19b19f">copy</a>(dx, hx.data(), N);</div><div class="line">};</div><div class="line"></div><div class="line"><a class="code" href="classtf_1_1Task.html">tf::Task</a> h2d_y = taskflow.<a class="code" href="classtf_1_1FlowBuilder.html#a796e29175380f70246cf2a5639adc437">emplace</a>([&](<a class="code" href="classtf_1_1cudaFlow.html">tf::cudaFlow</a>& cf) {</div><div class="line"> cf.<a class="code" href="classtf_1_1cudaFlow.html#af03e04771b655f9e629eb4c22e19b19f">copy</a>(dy, hy.data(), N);</div><div class="line">};</div><div class="line"></div><div class="line"><a class="code" href="classtf_1_1Task.html">tf::Task</a> d2h_x = taskflow.<a class="code" href="classtf_1_1FlowBuilder.html#a796e29175380f70246cf2a5639adc437">emplace</a>([&](<a class="code" href="classtf_1_1cudaFlow.html">tf::cudaFlow</a>& cf) {</div><div class="line"> cf.<a class="code" href="classtf_1_1cudaFlow.html#af03e04771b655f9e629eb4c22e19b19f">copy</a>(hx.data(), dx, N);</div><div class="line">};</div><div class="line"></div><div class="line"><a class="code" href="classtf_1_1Task.html">tf::Task</a> d2h_y = taskflow.<a class="code" href="classtf_1_1FlowBuilder.html#a796e29175380f70246cf2a5639adc437">emplace</a>([&](<a class="code" href="classtf_1_1cudaFlow.html">tf::cudaFlow</a>& cf) {</div><div class="line"> cf.<a class="code" href="classtf_1_1cudaFlow.html#af03e04771b655f9e629eb4c22e19b19f">copy</a>(hy.data(), dy, N);</div><div class="line">};</div><div class="line"></div><div class="line"><a class="code" href="classtf_1_1Task.html">tf::Task</a> kernel = taskflow.<a class="code" href="classtf_1_1FlowBuilder.html#a796e29175380f70246cf2a5639adc437">emplace</a>([&](<a class="code" href="classtf_1_1cudaFlow.html">tf::cudaFlow</a>& cf) {</div><div class="line"> cf.<a class="code" href="classtf_1_1cudaFlow.html#adb731be71bdd436dfb5e36e6213a9a17">kernel</a>((N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy);</div><div class="line">};</div><div class="line"></div><div class="line">kernel.<a class="code" href="classtf_1_1Task.html#a331b1b726555072e7c7d10941257f664">succeed</a>(h2d_x, h2d_y)</div><div class="line"> .<a class="code" href="classtf_1_1Task.html#a8c78c453295a553c1c016e4062da8588">precede</a>(d2h_x, d2h_y);</div></div><!-- fragment --><p>The following code aggregates the five GPU operations using one cudaFlow to deliver much better performance.</p>
<div class="fragment"><div class="line"><a class="code" href="classtf_1_1Task.html">tf::Task</a> cudaflow = taskflow.<a class="code" href="classtf_1_1FlowBuilder.html#a796e29175380f70246cf2a5639adc437">emplace</a>([&](<a class="code" href="classtf_1_1cudaFlow.html">tf::cudaFlow</a>& cf) {</div><div class="line"> <a class="code" href="classtf_1_1cudaTask.html">tf::cudaTask</a> h2d_x = cf.<a class="code" href="classtf_1_1cudaFlow.html#af03e04771b655f9e629eb4c22e19b19f">copy</a>(dx, hx.data(), N);</div><div class="line"> <a class="code" href="classtf_1_1cudaTask.html">tf::cudaTask</a> h2d_y = cf.<a class="code" href="classtf_1_1cudaFlow.html#af03e04771b655f9e629eb4c22e19b19f">copy</a>(dy, hy.data(), N);</div><div class="line"> <a class="code" href="classtf_1_1cudaTask.html">tf::cudaTask</a> d2h_x = cf.<a class="code" href="classtf_1_1cudaFlow.html#af03e04771b655f9e629eb4c22e19b19f">copy</a>(hx.data(), dx, N);</div><div class="line"> <a class="code" href="classtf_1_1cudaTask.html">tf::cudaTask</a> d2h_y = cf.<a class="code" href="classtf_1_1cudaFlow.html#af03e04771b655f9e629eb4c22e19b19f">copy</a>(hy.data(), dy, N);</div><div class="line"> <a class="code" href="classtf_1_1cudaTask.html">tf::cudaTask</a> kernel = cf.<a class="code" href="classtf_1_1cudaFlow.html#adb731be71bdd436dfb5e36e6213a9a17">kernel</a>((N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy);</div><div class="line"> kernel.<a class="code" href="classtf_1_1cudaTask.html#a4a9ca1a34bac47e4c9b04eb4fb2f7775">succeed</a>(h2d_x, h2d_y)</div><div class="line"> .<a class="code" href="classtf_1_1cudaTask.html#abdd68287ec4dff4216af34d1db44d1b4">precede</a>(d2h_x, d2h_y);</div><div class="line">});</div></div><!-- fragment --><p>We encourage users to study and understand the parallel structure of their applications, in order to come up with the best granularity of task decomposition. A refined task graph can have significant performance difference from the raw counterpart. </p>
</div></div><!-- contents -->
</div><!-- doc-content -->
<!-- start footer part -->
<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
<ul>
<li class="navelem"><a class="el" href="Cookbook.html">Cookbook</a></li>
<li class="footer">Generated by
<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.14 </li>
</ul>
</div>
</body>
</html>