CUDAåå¼·ä¸
CUDA by Exampleããã®ã¡ã¢ï¼
#define N 1024 __global__ void add(int *a, int *b, int *c){ //int tid = threadIdx.x; if(tid < N){ c[tid] = a[tid] + b[tid]; } }
ã¨ãã£ããã¯ãã«ã®å ç®ãããã¤ã¹ã«ããããã¨ãã«ãã©ããããã¯ãã¹ã¬ãããåããã°ãããã
æå
ã®ç°å¢ï¼QuadroK5000ï¼ã ã¨ãæ大ãããã¯æ°65535ãæ大ã¹ã¬ãã/ãããã¯æ°ã¯1024
é·ãN:1024ã®ãã¯ãã«ãè¨ç®ããå ´å
add<<<N,1>>>(dev_a, dev_b, dev_c);
ã¨ãããã¯ãå¢ããã¦1ãããã¯ããã1ã¹ã¬ããã§ãã
add<<<1,N>>>(dev_a, dev_b, dev_c);
ã¨ãã¦ã1ãããã¯Nã¹ã¬ããã§ãåãã
é·ãN:1024*100ã®ãã¯ãã«ãè¨ç®ããå ´å
ãã®å ´åãã¹ã¬ããã¨ãããã¯ãçµã¿åãããå¿
è¦ãããã
128ã¹ã¬ãã/ãããã¯ã«ããã¨ããã¨ã
add<<<N/128,128>>>(dev_a, dev_b, dev_c);
ã¨ããã¡ã ããããã ã¨N<128ã®ã¨ãã«ã¹ã¬ãããèµ·åããªãã
ããããå ´åã¯ã以ä¸ã®ããã«è¨è¿°ãã
add<<<(N+127)/128,128>>>(dev_a, dev_b, dev_c);
ã¡ãªã¿ã«ãã®ã¨ãã«ã¼ãã«ã¯ã
__global__ void add(int *a, int *b, int *c){ int tid = threadIdx.x + blockIdx.x * blockDim.x; if(tid < N){ ãããc[tid] = a[tid] + b[tid]; } } ã¨æ¸ãæãã
é·ãN:1024*100000ã®ãã¯ãã«ãè¨ç®ããå ´å
ãã®é·ãã«ãªãã¨ãåæèµ·åã§ããæ大ã¹ã¬ããæ°ãè¶
ããã®ã§ã
ã²ã¨ã¤ã®ã¹ã¬ããã«è¤æ°åè¨ç®ãããã
ã«ã¼ãã«ã以ä¸ã®ããã«æ¸ãæãã¦ã
__global__ void add(int *a, int *b, int *c){ int tid = threadIdx.x + blockIdx.x * blockDim.x; while(tid < N){ c[tid] = a[tid] + b[tid]; tid += blockDim.x * gridDim.x; }
ã¨ããã
ãã®å ´åã«ã¼ãã«ã®å¼ã³åºãã¯ã
add<<<128, 128>>>(dev_a, dev_b, dev_c);
ã§ããã
ããã©ã¼ãã³ã¹ã«ã¤ãã¦ã¯ã©ããªããã ãã
åç´ã«ä¸¦åå¦çããããç·ã¹ã¬ããæ°ã ããæå®ããã®ã§ã¯ãªãããããããããã¯æ°ãæå®ããã®ã¯ã
GPUå
é¨ãè¤æ°ã®Streaming Multiprocessor(SM)ã«åå²ããã¦ããããã
CUDAã¯ãããã¯ãSMã¸å²ãå½ã¦ãåä½ã«ãã¦ããã
ãã¨ãã°ããããã¯æ°ã1ã«æå®ãã¦ãã¾ãã¨ã1ã¤ã®SMããåä½ããªãã
ã§ã¯ããããã¯æ°=SMã¨ããã°è§£æ±ºããã¨ããã¨ããã§ã¯ãªãã
ãããã¯ãããã®ã¹ã¬ããæ°ã«ã¯ä¸éãããããã