以åã«、ã¯ããã¦ã®CUDAããã°ã©ãã³ã°ã§ååååå¦è¨ç®ã¨ããããã°è¨äºãæ¸ãããã¨ããã。æè¿ãã®ãªãã¡ãã£ã¦ååååå¦è¨ç®(MD)ããã°ã©ã ã®ã½ã¼ã¹ã³ã¼ããèªã¿ç´ãã¦ã¿ãã®ã ã、ããªãã²ã©ã。ããã、「CUDAããã°ã©ãã³ã°」ã§ã°ã°ã£ã¦ã¿ãã¨、ãã®è¨äºã2çªç®ã«ãã。ãããªããå æ¸ãªã³ã¼ããåèã«ããããèªãã 人ã«ãè¿·æãæããã®ã§ä¿®æ£ãããã¨ã«ãã。ããã»ã©CUDAã«æ
£ãã¦ããããã§ã¯ãªãã、ååã®ã³ã¼ãããã¯ã¾ãã ã¨æã。
ããã«ãã¦ãã³ã¼ãã®å 容ãã²ã©ã。æå³ããªã__syncthreads()ãå ¥ã£ã¦ããã、è¤æ°ã®ã¹ã¬ããããåãã°ãã¼ãã«ã¡ã¢ãªã«æ¸ãè¾¼ã¿ãã¦ãã、ã¬ã¸ã¹ã¿ãæ´»ç¨ãã¦ãªãã、è¨ç®ã®é åºã¯éå¹çã ã、ãã¡ãã¡ã 。
ããã§ã¾ã、CUDA Visual Profilerã§é¢æ°ã®ããã©ã¼ãã³ã¹æ¸¬å®ãè¡ã£ã。å ã¿ã«ãã®ãããã¡ã¤ã©ã¯CUDAããã°ã©ãã³ã°ãã¼ã«ãããã«å«ã¾ãã¦ãããã®ã§、CUDAã§ããã°ã©ãã³ã°ãè¡ãã«ã¯å¿ é ã ã¨æã。使ãæ¹ã¯ç°¡åã§、Windowsã§ããã°cudaprof.exeãå®è¡ãã¦、Fileã¡ãã¥ã¼ããNew...ãé¸ã³、ããã¸ã§ã¯ãã®ååã¨å ´æãè¨å®ãã¦、å®è¡ããCUDAããã°ã©ã ãæå®ããã ãã 。å¼æ°ãå¿ è¦ãªããããæå®ãã¦ãã。ããã©ã«ãã§ã¯4åå®è¡ãã解æããã。ããããã®é¢æ°ãã©ãã ãæéãããã£ãã®ããããããããã®ã§、ã©ããããã«ããã¯ã«ãªã£ã¦ããã®ãä¸ç®çç¶ã 。
ã¾ã、äºæ³ãã¦ããéã、Calcé¢æ°ãå ¨ä½ã®99.8%ãå ãã¦ããã®ã§、ããããä¿®æ£ãè¡ã£ã。ã¾ã、iã¨jã使ã£ã¦ããã«ã¼ããjã ãã«ãã。iã¯ã¹ã¬ããæ°ã§åå²ããã¦ããã、ãããã¯ã¨åããã¦ãããæ¶ãã。次ã«、jã®ã«ã¼ãå ã§ä¸å¿ è¦ã«ã°ãã¼ãã«ã¡ã¢ãªã«ã¢ã¯ã»ã¹ããªãããã«ãã。ä¾ãã°chg[i]ã¨ããã°ãã¼ãã«å¤æ°ã¯ã«ã¼ãå¤ã§ãã¼ã«ã«å¤æ°ã«å ¥ãã¦ããã使ãããã«ãã。ãã¨ã¯、ã§ããã ãé¤ç®ãªã©ã®æ¼ç®ãæ¸ããããã«ãã。
次ã«、ç³»å ¨ä½ãä¸å¿ã«æ»ãé¢æ°ãããã®ã ã、ä¸å¿åº§æ¨ãæ±ããé¨åã¯ãã¹ãå´ã®é¢æ°ã§å®è£ ã、ç³»å ¨ä½ãæ»ãé¢æ°ã®ã¿ãCUDAã®é¢æ°ã¨ãã。ãã¨ãã¨ãã®å¦çã¯æ¯åãããªãã¦ããããã®ãªã®ã§、æå®ããã¹ãããæ¯ã«1åè¡ãããã«ãã。ä»åã¯100ã¹ãããã«1åã¨ãã¦ãã。ãªã®ã§、å ¨ä½ã®å¦çæéããè¦ãã°ç¡è¦ãã¦ãè¯ã、é å¼µã£ã¦ãã¹ã¦GPUã§å¦çããå¿ è¦ã¯ãªãã¨èãã。
ããã«、å ¨ä½ã®ã¨ãã«ã®ã¼ãæ±ããè¨ç®ãè¦ç´ãã¦ã¿ã。ããã¯é ·ãã¦、ååã®ã³ã¼ãã§ã¯Calcé¢æ°ãã¨ã«å ¨ã¨ãã«ã®ã¼ã®åãæ±ããã¨ãã訳ãåãããªããã¨ããã¦ãã。ããã§、å ¨ã¨ãã«ã®ã¼ãæ±ããé¨åã®é¢æ°ãå¥ã«æ¸ã、ããã«ä¸¦åãªãã¯ã·ã§ã³ãå®è£ ãã。並åãªãã¯ã·ã§ã³ã«ã¤ãã¦ã¯Optimizing Parallel Reduction in CUDAã詳ãã。ä»åã®å®è£ ããããåèã«ããã¦ããã£ã。ãã 、æ°ååç¨åº¦ã®ç²åæ°ã§ããã°、ããã¤ã¹ããã®ã¡ã¢ãªã³ãã¼ãèãã¦ããã¹ãå´ã§è¨ç®ããã¦ãã¾ãæ¹ãéããããããªã。ä¸å¿1024åã§ã¹ã¤ããããããã«ãã¦ããã、ãã£ã¨å¤§ããªå¤ã§è¯ãã¨æã。
以ä¸ã«å®éã®å¦çé度ã®çµæã示ã。ãã¹ããã¼ã¿ã¨ãã¦4,139åã®ç²åæ°ãæã¤ãã¼ã¿ãç¨æãã¦、ããã100ã¹ãããè¨ç®ããã。åä½ç°å¢ã¨ãã¦Intel Core2 [email protected] / NVIDIA GeForce 8800 GTSã使ç¨ãã。
以åæ¸ããC++ã®ããã°ã©ã ã¯307.1ç§ããã£ã¦ãã。ååã®CUDAããã°ã©ã ã§35.2ç§、ä»åã®ã³ã¼ãã6.6ç§ã ã£ã。C++ã¨æ¯ã¹ãã¨46.2å、ååã®CUDAããã°ã©ã ã¨æ¯ã¹ã¦ã5.3åé«éã«åä½ãã。
æå¾ã«、ä»åä½æããã½ã¼ã¹ã³ã¼ãã示ãã¦ãã。ã³ã³ãã¤ã«æ¹æ³ãå®è¡æ¹æ³ã¯ååã®è¨äºãåèã«ãã¦æ¬²ãã。
simple_md_gpu.cu
//////////////////////////////////////////////////////////////// // Simple Molecular Dynamics using GPU by nox, 2009.12.10. // // Perform molecular dynamics for pseudo-atoms without internal // interactions, such as bonds, angles, dihedrals. Only vdW and // Coulomb interactions. #include <iostream> #include <iomanip> #include <fstream> #include <sstream> #include <algorithm> #include <numeric> #include <cmath> #include <cutil_inline.h> using namespace std; const int BLOCK = 256; __device__ float Distance(float* crd, int i, int j) { float dx = crd[i*3] - crd[j*3]; float dy = crd[i*3+1] - crd[j*3+1]; float dz = crd[i*3+2] - crd[j*3+2]; return sqrtf(dx * dx + dy * dy + dz * dz); } __global__ void Center(float* crd, float cx, float cy, float cz, int num_atoms) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i >= num_atoms) return; crd[i*3] -= cx; crd[i*3+1] -= cy; crd[i*3+2] -= cz; } // Calculate energies and forces // Total energy = vdW + Coulomb // vdW // U = eps * [(Rij / r[i])^12 - 2 * (Rij / r[i])^6] // F = -12 * eps / Rij * [(Rij / r[i])^13 - (Rij / r[i])^7] * r_xyz / r[i] // Coulomb // U = SUM_i>j qiqj / r[i] // F = SUM_j qiqj / r[i]^3 * r_xyz __global__ void Calc(float* crd, float* f, float* ene, float* chg, int num_atoms) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i >= num_atoms) return; const float eps = 0.2f; const float Rij = 2.5f; float r, rij, r12, r6, r3, r_, q, x; float e0, f0, f1, f2; e0 = f0 = f1 = f2 = 0.0f; float c0 = crd[i*3]; float c1 = crd[i*3+1]; float c2 = crd[i*3+2]; float q0 = chg[i]; for (int j = 0; j < num_atoms; j++) { if (i == j) continue; r = Distance(crd, i, j); r_ = 1.0f / r; q = q0 * chg[j]; rij = Rij * r_; r3 = rij * rij * rij; r6 = r3 * r3; r12 = r6 * r6; x = ((12 * eps) * (r12 - r6) + q * r_) * r_ * r_; e0 += eps * (r12 - 2.0f * r6) + q * r_; f0 += x * (c0 - crd[j*3]); f1 += x * (c1 - crd[j*3+1]); f2 += x * (c2 - crd[j*3+2]); } ene[i] = e0; f[i*3] = f0; f[i*3+1] = f1; f[i*3+2] = f2; } template<unsigned int blockSize> __global__ void Energy(float* ene, float* oene, unsigned int n) { extern __shared__ float sdata[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x * (blockSize * 2) + tid; unsigned int gridSize = blockSize * 2 * gridDim.x; sdata[tid] = 0; while (i < n) { sdata[tid] += ene[i] + ene[i+blockSize]; i += gridSize; } __syncthreads(); if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid+256]; } __syncthreads(); } if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid+128]; } __syncthreads(); } if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid+ 64]; } __syncthreads(); } if (tid < 32) { if (blockSize >= 64) sdata[tid] += sdata[tid+32]; if (blockSize >= 32) sdata[tid] += sdata[tid+16]; if (blockSize >= 16) sdata[tid] += sdata[tid+ 8]; if (blockSize >= 8) sdata[tid] += sdata[tid+ 4]; if (blockSize >= 4) sdata[tid] += sdata[tid+ 2]; if (blockSize >= 2) sdata[tid] += sdata[tid+ 1]; } if (tid == 0) oene[blockIdx.x] = sdata[0]; } __global__ void Move(float* crd, float* f, int num_atoms, float cap_range) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i >= num_atoms) return; crd[i*3] += f[i*3] * 0.01f; crd[i*3+1] += f[i*3+1] * 0.01f; crd[i*3+2] += f[i*3+2] * 0.01f; float r = crd[i*3] * crd[i*3] + crd[i*3+1] * crd[i*3+1] + crd[i*3+2] * crd[i*3+2]; float dr = r - cap_range * cap_range; if (dr > 0.0f) { f[i*3] = -crd[i*3] / cap_range * dr * 0.01f; f[i*3+1] = -crd[i*3+1] / cap_range * dr * 0.01f; f[i*3+2] = -crd[i*3+2] / cap_range * dr * 0.01f; } } class SimpleMD { private: dim3 grid; dim3 threads; int width; int num_steps; int num_atoms; int num_center; float cap_range; float *h_crd, *h_f, *h_ene, *h_chg; float *d_crd, *d_f, *d_ene, *d_chg, *d_oene; float tene; public: SimpleMD(int n, int nc, char*); ~SimpleMD(); void ReadCrd(char*); void PrintCrd(); void CenterPosition(); unsigned int NextPow2(unsigned int x); void GetBlocksAndThreads(int n, int& blocks, int& threads); void TotalEnergyReduce(int threads, int blocks); void TotalEnergy(); void Run(); }; SimpleMD::SimpleMD(int n, int nc, char* fname) : num_steps(n), num_center(nc) { ReadCrd(fname); h_f = new float[num_atoms * 3]; fill(h_f, h_f + num_atoms * 3, 0.0f); h_ene = new float[NextPow2(num_atoms)]; fill(h_ene, h_ene + NextPow2(num_atoms), 0.0f); width = (num_atoms - 1) / BLOCK + 1; grid.x = width; grid.y = 1; grid.z = 1; threads.x = BLOCK; threads.y = 1; threads.z = 1; cudaMalloc((void**)&d_crd, sizeof(float) * num_atoms * 3); cudaMalloc((void**)&d_f, sizeof(float) * num_atoms * 3); cudaMalloc((void**)&d_oene, sizeof(float) * num_atoms); cudaMalloc((void**)&d_ene, sizeof(float) * NextPow2(num_atoms)); cudaMalloc((void**)&d_chg, sizeof(float) * num_atoms); } SimpleMD::~SimpleMD() { cudaFree(d_chg); cudaFree(d_ene); cudaFree(d_f); cudaFree(d_crd); delete[] h_ene; delete[] h_f; delete[] h_chg; delete[] h_crd; } void SimpleMD::ReadCrd(char* fname) { fstream fs(fname, ios_base::in); string line; stringstream ss; if (!fs.is_open()) { cerr << "File open error: " << fname << endl; exit(1); } getline(fs, line); cout << line << endl; getline(fs, line); ss.str(line); ss >> num_atoms; ss.clear(); getline(fs, line); ss.str(line); ss >> cap_range; ss.clear(); h_crd = new float[num_atoms * 3]; h_chg = new float[num_atoms]; for (int i = 0; i < num_atoms; i++) { getline(fs, line); ss.str(line); ss >> h_crd[i*3] >> h_crd[i*3+1] >> h_crd[i*3+2] >> h_chg[i]; ss.clear(); ss.str(""); } fs.close(); } void SimpleMD::PrintCrd() { cout << endl; cout << num_atoms << endl; cout << cap_range << endl; for (int i = 0; i < num_atoms; i++) { for (int j = 0; j < 3; j++) cout << " " << fixed << setw(10) << setprecision(6) << h_crd[i*3+j]; cout << " " << fixed << setw(8) << setprecision(4) << h_chg[i]; cout << endl; } } void SimpleMD::CenterPosition() { float d[3]; cudaMemcpy(h_crd, d_crd, sizeof(float) * num_atoms * 3, cudaMemcpyDeviceToHost); for (int i = 0; i < num_atoms; i++) for (int j = 0; j < 3; j++) d[j] += h_crd[i*3+j]; for (int i = 0; i < 3; i++) d[i] /= num_atoms; cudaMemcpy(d_crd, h_crd, sizeof(float) * num_atoms * 3, cudaMemcpyHostToDevice); Center<<<grid, threads>>>(d_crd, d[0], d[1], d[2], num_atoms); cudaThreadSynchronize(); } unsigned int SimpleMD::NextPow2(unsigned int x) { --x; x |= x >> 1; x |= x >> 2; x |= x >> 4; x |= x >> 8; x |= x >> 16; return ++x; } void SimpleMD::GetBlocksAndThreads(int n, int& blocks, int& threads) { threads = (n < BLOCK * 2) ? NextPow2((n + 1) / 2) : BLOCK; blocks = (n + (threads * 2 - 1)) / (threads * 2); blocks = min(width, blocks); } void SimpleMD::TotalEnergyReduce(int threads, int blocks) { switch (threads) { case 512: Energy<512><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; case 256: Energy<256><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; case 128: Energy<128><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; case 64: Energy< 64><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; case 32: Energy< 32><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; case 16: Energy< 16><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; case 8: Energy< 8><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; case 4: Energy< 4><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; case 2: Energy< 2><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; case 1: Energy< 1><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; } } void SimpleMD::TotalEnergy() { if (num_atoms > 1024) { int th, bl; GetBlocksAndThreads(num_atoms, bl, th); int s = bl; while (s > 1) { GetBlocksAndThreads(s, bl, th); TotalEnergyReduce(th, bl); s = (s + (th * 2 - 1)) / (th * 2); } cudaThreadSynchronize(); cudaMemcpy(&tene, d_oene, sizeof(float), cudaMemcpyDeviceToHost); tene /= 2.0f; } else { cudaMemcpy(h_ene, d_ene, sizeof(float) * num_atoms, cudaMemcpyDeviceToHost); tene = accumulate(h_ene, h_ene + num_atoms, 0.0f) / 2.0f; } } void SimpleMD::Run() { cudaMemcpy(d_crd, h_crd, sizeof(float) * num_atoms * 3, cudaMemcpyHostToDevice); cudaMemcpy(d_f, h_f, sizeof(float) * num_atoms * 3, cudaMemcpyHostToDevice); cudaMemcpy(d_ene, h_ene, sizeof(float) * NextPow2(num_atoms), cudaMemcpyHostToDevice); cudaMemcpy(d_chg, h_chg, sizeof(float) * num_atoms, cudaMemcpyHostToDevice); for (int i = 0; i < num_steps; i++) { if (i % num_center == 0) CenterPosition(); Calc<<<grid, threads>>>(d_crd, d_f, d_ene, d_chg, num_atoms); cudaThreadSynchronize(); TotalEnergy(); Move<<<grid, threads>>>(d_crd, d_f, num_atoms, cap_range); cudaThreadSynchronize(); cout << "Energy (" << setw(7) << i + 1 << "): "; cout << fixed << setw(15) << setprecision(5) << tene << endl; } cudaMemcpy(h_crd, d_crd, sizeof(float) * num_atoms * 3, cudaMemcpyDeviceToHost); cudaMemcpy(h_f, d_f, sizeof(float) * num_atoms * 3, cudaMemcpyDeviceToHost); cudaMemcpy(h_ene, d_ene, sizeof(float) * num_atoms, cudaMemcpyDeviceToHost); } int main(int argc, char** argv) { if (argc != 3) { cerr << "Usage: " << argv[0] << " input_file number_of_steps" << endl; cerr << "Input: line 1 : title" << endl; cerr << " line 2 : number of atoms" << endl; cerr << " line 3 : radius of droplet" << endl; cerr << " line 4-: x-crd y-crd z-crd charge" << endl; exit(1); } stringstream ss; int nstep; cutilDeviceInit(1, argv); ss.str(argv[2]); ss >> nstep; int ncenter = 100; SimpleMD md(nstep, ncenter, argv[1]); md.PrintCrd(); md.Run(); md.PrintCrd(); return 0; }
ããã«ãã¦ãã³ã¼ãã®å 容ãã²ã©ã。æå³ããªã__syncthreads()ãå ¥ã£ã¦ããã、è¤æ°ã®ã¹ã¬ããããåãã°ãã¼ãã«ã¡ã¢ãªã«æ¸ãè¾¼ã¿ãã¦ãã、ã¬ã¸ã¹ã¿ãæ´»ç¨ãã¦ãªãã、è¨ç®ã®é åºã¯éå¹çã ã、ãã¡ãã¡ã 。
ããã§ã¾ã、CUDA Visual Profilerã§é¢æ°ã®ããã©ã¼ãã³ã¹æ¸¬å®ãè¡ã£ã。å ã¿ã«ãã®ãããã¡ã¤ã©ã¯CUDAããã°ã©ãã³ã°ãã¼ã«ãããã«å«ã¾ãã¦ãããã®ã§、CUDAã§ããã°ã©ãã³ã°ãè¡ãã«ã¯å¿ é ã ã¨æã。使ãæ¹ã¯ç°¡åã§、Windowsã§ããã°cudaprof.exeãå®è¡ãã¦、Fileã¡ãã¥ã¼ããNew...ãé¸ã³、ããã¸ã§ã¯ãã®ååã¨å ´æãè¨å®ãã¦、å®è¡ããCUDAããã°ã©ã ãæå®ããã ãã 。å¼æ°ãå¿ è¦ãªããããæå®ãã¦ãã。ããã©ã«ãã§ã¯4åå®è¡ãã解æããã。ããããã®é¢æ°ãã©ãã ãæéãããã£ãã®ããããããããã®ã§、ã©ããããã«ããã¯ã«ãªã£ã¦ããã®ãä¸ç®çç¶ã 。
ã¾ã、äºæ³ãã¦ããéã、Calcé¢æ°ãå ¨ä½ã®99.8%ãå ãã¦ããã®ã§、ããããä¿®æ£ãè¡ã£ã。ã¾ã、iã¨jã使ã£ã¦ããã«ã¼ããjã ãã«ãã。iã¯ã¹ã¬ããæ°ã§åå²ããã¦ããã、ãããã¯ã¨åããã¦ãããæ¶ãã。次ã«、jã®ã«ã¼ãå ã§ä¸å¿ è¦ã«ã°ãã¼ãã«ã¡ã¢ãªã«ã¢ã¯ã»ã¹ããªãããã«ãã。ä¾ãã°chg[i]ã¨ããã°ãã¼ãã«å¤æ°ã¯ã«ã¼ãå¤ã§ãã¼ã«ã«å¤æ°ã«å ¥ãã¦ããã使ãããã«ãã。ãã¨ã¯、ã§ããã ãé¤ç®ãªã©ã®æ¼ç®ãæ¸ããããã«ãã。
次ã«、ç³»å ¨ä½ãä¸å¿ã«æ»ãé¢æ°ãããã®ã ã、ä¸å¿åº§æ¨ãæ±ããé¨åã¯ãã¹ãå´ã®é¢æ°ã§å®è£ ã、ç³»å ¨ä½ãæ»ãé¢æ°ã®ã¿ãCUDAã®é¢æ°ã¨ãã。ãã¨ãã¨ãã®å¦çã¯æ¯åãããªãã¦ããããã®ãªã®ã§、æå®ããã¹ãããæ¯ã«1åè¡ãããã«ãã。ä»åã¯100ã¹ãããã«1åã¨ãã¦ãã。ãªã®ã§、å ¨ä½ã®å¦çæéããè¦ãã°ç¡è¦ãã¦ãè¯ã、é å¼µã£ã¦ãã¹ã¦GPUã§å¦çããå¿ è¦ã¯ãªãã¨èãã。
ããã«、å ¨ä½ã®ã¨ãã«ã®ã¼ãæ±ããè¨ç®ãè¦ç´ãã¦ã¿ã。ããã¯é ·ãã¦、ååã®ã³ã¼ãã§ã¯Calcé¢æ°ãã¨ã«å ¨ã¨ãã«ã®ã¼ã®åãæ±ããã¨ãã訳ãåãããªããã¨ããã¦ãã。ããã§、å ¨ã¨ãã«ã®ã¼ãæ±ããé¨åã®é¢æ°ãå¥ã«æ¸ã、ããã«ä¸¦åãªãã¯ã·ã§ã³ãå®è£ ãã。並åãªãã¯ã·ã§ã³ã«ã¤ãã¦ã¯Optimizing Parallel Reduction in CUDAã詳ãã。ä»åã®å®è£ ããããåèã«ããã¦ããã£ã。ãã 、æ°ååç¨åº¦ã®ç²åæ°ã§ããã°、ããã¤ã¹ããã®ã¡ã¢ãªã³ãã¼ãèãã¦ããã¹ãå´ã§è¨ç®ããã¦ãã¾ãæ¹ãéããããããªã。ä¸å¿1024åã§ã¹ã¤ããããããã«ãã¦ããã、ãã£ã¨å¤§ããªå¤ã§è¯ãã¨æã。
以ä¸ã«å®éã®å¦çé度ã®çµæã示ã。ãã¹ããã¼ã¿ã¨ãã¦4,139åã®ç²åæ°ãæã¤ãã¼ã¿ãç¨æãã¦、ããã100ã¹ãããè¨ç®ããã。åä½ç°å¢ã¨ãã¦Intel Core2 [email protected] / NVIDIA GeForce 8800 GTSã使ç¨ãã。
以åæ¸ããC++ã®ããã°ã©ã ã¯307.1ç§ããã£ã¦ãã。ååã®CUDAããã°ã©ã ã§35.2ç§、ä»åã®ã³ã¼ãã6.6ç§ã ã£ã。C++ã¨æ¯ã¹ãã¨46.2å、ååã®CUDAããã°ã©ã ã¨æ¯ã¹ã¦ã5.3åé«éã«åä½ãã。
æå¾ã«、ä»åä½æããã½ã¼ã¹ã³ã¼ãã示ãã¦ãã。ã³ã³ãã¤ã«æ¹æ³ãå®è¡æ¹æ³ã¯ååã®è¨äºãåèã«ãã¦æ¬²ãã。
simple_md_gpu.cu
//////////////////////////////////////////////////////////////// // Simple Molecular Dynamics using GPU by nox, 2009.12.10. // // Perform molecular dynamics for pseudo-atoms without internal // interactions, such as bonds, angles, dihedrals. Only vdW and // Coulomb interactions. #include <iostream> #include <iomanip> #include <fstream> #include <sstream> #include <algorithm> #include <numeric> #include <cmath> #include <cutil_inline.h> using namespace std; const int BLOCK = 256; __device__ float Distance(float* crd, int i, int j) { float dx = crd[i*3] - crd[j*3]; float dy = crd[i*3+1] - crd[j*3+1]; float dz = crd[i*3+2] - crd[j*3+2]; return sqrtf(dx * dx + dy * dy + dz * dz); } __global__ void Center(float* crd, float cx, float cy, float cz, int num_atoms) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i >= num_atoms) return; crd[i*3] -= cx; crd[i*3+1] -= cy; crd[i*3+2] -= cz; } // Calculate energies and forces // Total energy = vdW + Coulomb // vdW // U = eps * [(Rij / r[i])^12 - 2 * (Rij / r[i])^6] // F = -12 * eps / Rij * [(Rij / r[i])^13 - (Rij / r[i])^7] * r_xyz / r[i] // Coulomb // U = SUM_i>j qiqj / r[i] // F = SUM_j qiqj / r[i]^3 * r_xyz __global__ void Calc(float* crd, float* f, float* ene, float* chg, int num_atoms) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i >= num_atoms) return; const float eps = 0.2f; const float Rij = 2.5f; float r, rij, r12, r6, r3, r_, q, x; float e0, f0, f1, f2; e0 = f0 = f1 = f2 = 0.0f; float c0 = crd[i*3]; float c1 = crd[i*3+1]; float c2 = crd[i*3+2]; float q0 = chg[i]; for (int j = 0; j < num_atoms; j++) { if (i == j) continue; r = Distance(crd, i, j); r_ = 1.0f / r; q = q0 * chg[j]; rij = Rij * r_; r3 = rij * rij * rij; r6 = r3 * r3; r12 = r6 * r6; x = ((12 * eps) * (r12 - r6) + q * r_) * r_ * r_; e0 += eps * (r12 - 2.0f * r6) + q * r_; f0 += x * (c0 - crd[j*3]); f1 += x * (c1 - crd[j*3+1]); f2 += x * (c2 - crd[j*3+2]); } ene[i] = e0; f[i*3] = f0; f[i*3+1] = f1; f[i*3+2] = f2; } template<unsigned int blockSize> __global__ void Energy(float* ene, float* oene, unsigned int n) { extern __shared__ float sdata[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x * (blockSize * 2) + tid; unsigned int gridSize = blockSize * 2 * gridDim.x; sdata[tid] = 0; while (i < n) { sdata[tid] += ene[i] + ene[i+blockSize]; i += gridSize; } __syncthreads(); if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid+256]; } __syncthreads(); } if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid+128]; } __syncthreads(); } if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid+ 64]; } __syncthreads(); } if (tid < 32) { if (blockSize >= 64) sdata[tid] += sdata[tid+32]; if (blockSize >= 32) sdata[tid] += sdata[tid+16]; if (blockSize >= 16) sdata[tid] += sdata[tid+ 8]; if (blockSize >= 8) sdata[tid] += sdata[tid+ 4]; if (blockSize >= 4) sdata[tid] += sdata[tid+ 2]; if (blockSize >= 2) sdata[tid] += sdata[tid+ 1]; } if (tid == 0) oene[blockIdx.x] = sdata[0]; } __global__ void Move(float* crd, float* f, int num_atoms, float cap_range) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i >= num_atoms) return; crd[i*3] += f[i*3] * 0.01f; crd[i*3+1] += f[i*3+1] * 0.01f; crd[i*3+2] += f[i*3+2] * 0.01f; float r = crd[i*3] * crd[i*3] + crd[i*3+1] * crd[i*3+1] + crd[i*3+2] * crd[i*3+2]; float dr = r - cap_range * cap_range; if (dr > 0.0f) { f[i*3] = -crd[i*3] / cap_range * dr * 0.01f; f[i*3+1] = -crd[i*3+1] / cap_range * dr * 0.01f; f[i*3+2] = -crd[i*3+2] / cap_range * dr * 0.01f; } } class SimpleMD { private: dim3 grid; dim3 threads; int width; int num_steps; int num_atoms; int num_center; float cap_range; float *h_crd, *h_f, *h_ene, *h_chg; float *d_crd, *d_f, *d_ene, *d_chg, *d_oene; float tene; public: SimpleMD(int n, int nc, char*); ~SimpleMD(); void ReadCrd(char*); void PrintCrd(); void CenterPosition(); unsigned int NextPow2(unsigned int x); void GetBlocksAndThreads(int n, int& blocks, int& threads); void TotalEnergyReduce(int threads, int blocks); void TotalEnergy(); void Run(); }; SimpleMD::SimpleMD(int n, int nc, char* fname) : num_steps(n), num_center(nc) { ReadCrd(fname); h_f = new float[num_atoms * 3]; fill(h_f, h_f + num_atoms * 3, 0.0f); h_ene = new float[NextPow2(num_atoms)]; fill(h_ene, h_ene + NextPow2(num_atoms), 0.0f); width = (num_atoms - 1) / BLOCK + 1; grid.x = width; grid.y = 1; grid.z = 1; threads.x = BLOCK; threads.y = 1; threads.z = 1; cudaMalloc((void**)&d_crd, sizeof(float) * num_atoms * 3); cudaMalloc((void**)&d_f, sizeof(float) * num_atoms * 3); cudaMalloc((void**)&d_oene, sizeof(float) * num_atoms); cudaMalloc((void**)&d_ene, sizeof(float) * NextPow2(num_atoms)); cudaMalloc((void**)&d_chg, sizeof(float) * num_atoms); } SimpleMD::~SimpleMD() { cudaFree(d_chg); cudaFree(d_ene); cudaFree(d_f); cudaFree(d_crd); delete[] h_ene; delete[] h_f; delete[] h_chg; delete[] h_crd; } void SimpleMD::ReadCrd(char* fname) { fstream fs(fname, ios_base::in); string line; stringstream ss; if (!fs.is_open()) { cerr << "File open error: " << fname << endl; exit(1); } getline(fs, line); cout << line << endl; getline(fs, line); ss.str(line); ss >> num_atoms; ss.clear(); getline(fs, line); ss.str(line); ss >> cap_range; ss.clear(); h_crd = new float[num_atoms * 3]; h_chg = new float[num_atoms]; for (int i = 0; i < num_atoms; i++) { getline(fs, line); ss.str(line); ss >> h_crd[i*3] >> h_crd[i*3+1] >> h_crd[i*3+2] >> h_chg[i]; ss.clear(); ss.str(""); } fs.close(); } void SimpleMD::PrintCrd() { cout << endl; cout << num_atoms << endl; cout << cap_range << endl; for (int i = 0; i < num_atoms; i++) { for (int j = 0; j < 3; j++) cout << " " << fixed << setw(10) << setprecision(6) << h_crd[i*3+j]; cout << " " << fixed << setw(8) << setprecision(4) << h_chg[i]; cout << endl; } } void SimpleMD::CenterPosition() { float d[3]; cudaMemcpy(h_crd, d_crd, sizeof(float) * num_atoms * 3, cudaMemcpyDeviceToHost); for (int i = 0; i < num_atoms; i++) for (int j = 0; j < 3; j++) d[j] += h_crd[i*3+j]; for (int i = 0; i < 3; i++) d[i] /= num_atoms; cudaMemcpy(d_crd, h_crd, sizeof(float) * num_atoms * 3, cudaMemcpyHostToDevice); Center<<<grid, threads>>>(d_crd, d[0], d[1], d[2], num_atoms); cudaThreadSynchronize(); } unsigned int SimpleMD::NextPow2(unsigned int x) { --x; x |= x >> 1; x |= x >> 2; x |= x >> 4; x |= x >> 8; x |= x >> 16; return ++x; } void SimpleMD::GetBlocksAndThreads(int n, int& blocks, int& threads) { threads = (n < BLOCK * 2) ? NextPow2((n + 1) / 2) : BLOCK; blocks = (n + (threads * 2 - 1)) / (threads * 2); blocks = min(width, blocks); } void SimpleMD::TotalEnergyReduce(int threads, int blocks) { switch (threads) { case 512: Energy<512><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; case 256: Energy<256><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; case 128: Energy<128><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; case 64: Energy< 64><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; case 32: Energy< 32><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; case 16: Energy< 16><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; case 8: Energy< 8><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; case 4: Energy< 4><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; case 2: Energy< 2><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; case 1: Energy< 1><<<blocks, threads, sizeof(float) * threads>>>(d_ene, d_oene, num_atoms); break; } } void SimpleMD::TotalEnergy() { if (num_atoms > 1024) { int th, bl; GetBlocksAndThreads(num_atoms, bl, th); int s = bl; while (s > 1) { GetBlocksAndThreads(s, bl, th); TotalEnergyReduce(th, bl); s = (s + (th * 2 - 1)) / (th * 2); } cudaThreadSynchronize(); cudaMemcpy(&tene, d_oene, sizeof(float), cudaMemcpyDeviceToHost); tene /= 2.0f; } else { cudaMemcpy(h_ene, d_ene, sizeof(float) * num_atoms, cudaMemcpyDeviceToHost); tene = accumulate(h_ene, h_ene + num_atoms, 0.0f) / 2.0f; } } void SimpleMD::Run() { cudaMemcpy(d_crd, h_crd, sizeof(float) * num_atoms * 3, cudaMemcpyHostToDevice); cudaMemcpy(d_f, h_f, sizeof(float) * num_atoms * 3, cudaMemcpyHostToDevice); cudaMemcpy(d_ene, h_ene, sizeof(float) * NextPow2(num_atoms), cudaMemcpyHostToDevice); cudaMemcpy(d_chg, h_chg, sizeof(float) * num_atoms, cudaMemcpyHostToDevice); for (int i = 0; i < num_steps; i++) { if (i % num_center == 0) CenterPosition(); Calc<<<grid, threads>>>(d_crd, d_f, d_ene, d_chg, num_atoms); cudaThreadSynchronize(); TotalEnergy(); Move<<<grid, threads>>>(d_crd, d_f, num_atoms, cap_range); cudaThreadSynchronize(); cout << "Energy (" << setw(7) << i + 1 << "): "; cout << fixed << setw(15) << setprecision(5) << tene << endl; } cudaMemcpy(h_crd, d_crd, sizeof(float) * num_atoms * 3, cudaMemcpyDeviceToHost); cudaMemcpy(h_f, d_f, sizeof(float) * num_atoms * 3, cudaMemcpyDeviceToHost); cudaMemcpy(h_ene, d_ene, sizeof(float) * num_atoms, cudaMemcpyDeviceToHost); } int main(int argc, char** argv) { if (argc != 3) { cerr << "Usage: " << argv[0] << " input_file number_of_steps" << endl; cerr << "Input: line 1 : title" << endl; cerr << " line 2 : number of atoms" << endl; cerr << " line 3 : radius of droplet" << endl; cerr << " line 4-: x-crd y-crd z-crd charge" << endl; exit(1); } stringstream ss; int nstep; cutilDeviceInit(1, argv); ss.str(argv[2]); ss >> nstep; int ncenter = 100; SimpleMD md(nstep, ncenter, argv[1]); md.PrintCrd(); md.Run(); md.PrintCrd(); return 0; }
ã³ã¡ã³ã
è¨äºã§åèã«ãªãã®ã¯ãªãã
æ¢ãã¦ããã¾ãã。
ããã£ãããã®è¨äºãç§ã®
ã¡ã¼ãªã³ã°ãªã¹ãã«è¼ãããã¨æãã¾ãã
ããããã§ãããã?
ã³ã¡ã³ããããã¨ããããã¾ã。
è¿äºãé ããªãæ¸ã¿ã¾ãã。
ãã®è¨äºãã¡ã¼ãªã³ã°ãªã¹ãã«è¼ãããã¨ã®ãã¨ã§ãã、ãèªç±ã«ç´¹ä»ãã¦é ãã¦æ§ãã¾ãã。
ãããããé¡ããã¾ã。