How could I run GPU docker? #361

hyzhak · 2018-10-28T22:12:50Z

The question is: How could use kaggle docker with GPU?

I haven't found any examples how could I use already built kaggle docker-python for GPU. So I decided to built it by myself.

I cloned current repository and built GPU docker from there (build --gpu). After that I run docker to test where we have GPUs there (it was for me with official tensorflow DockerFile tensorflow/tensorflow:latest-gpu-py3 from here: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/dockerfiles)

Script:

import tensorflow as tf
from tensorflow.python.client import device_lib

def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']

get_available_gpus()

for tensorflow/tensorflow:latest-gpu-py3 I've received:

['/device:GPU:0']

But in kaggle/python-gpu-build it won't work and response was:

[]

and I've found errors in logs:

tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: UNKNOWN ERROR (-1)
tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: 24cb5b98c9ce
tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: 24cb5b98c9ce tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 410.48.0 eug

side note: I'm using nvidia-docker2 by --runtime=nvidia.

Does kaggle/python-gpu-build requires extra work to tune it before run? And where can I find more information how could I use it?
Thanks!

The text was updated successfully, but these errors were encountered:

pricebenjamin · 2018-12-18T04:38:42Z

I was running into this same issue. I believe I've found a work-around.

First, check that your container can see your GPU with nvidia-smi:

(host) $ docker run --runtime=nvidia --rm -it kaggle/python-gpu-build bash
(container) $ nvidia-smi

You may get an error:

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

If this is the case, run the command ldconfig from inside your container; nvidia-smi should now work, but you might see CUDA Version: ERR! as in

Tue Dec 18 04:06:34 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: ERR!     |
|-------------------------------+----------------------+----------------------+
| [...]                         | [...]                | [...]                |

This appears to be due to the path /usr/local/cuda/lib64/stubs being in your LD_LIBRARY_PATH environment variable. (Not sure why this is the case; this string should not be here.) Check if this is the case with env | grep stubs. This should return

LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64/stubs

We simply need to remove this entry from LD_LIBRARY_PATH. I've elected to completely overwrite the variable, but this may not be desired in every scenario. Use at your own discretion.

(container) $ export LD_LIBRARY_PATH=/usr/local/cuda/lib64

nvidia-smi should no longer report an error, and TensorFlow appears to work as expected at this point.

>>> import tensorflow as tf
>>> tf.test.is_gpu_available()
[...]
True

rosbo · 2018-12-21T19:20:08Z

Thank you @pricebenjamin for sharing your detailed instructions. I linked to them from the "Running the image" section in the repo's README.

rosbo added the question label Nov 9, 2018

rosbo closed this as completed Jan 3, 2019

rosbo self-assigned this Jan 7, 2019

angerson mentioned this issue Feb 25, 2020

CUDA broken on docker image tensorflow/tensorflow:devel-gpu tensorflow/tensorflow#36974

Closed

yosuke mentioned this issue Dec 16, 2020

採点用サーバーにおけるのcudaのバージョンについて hsr-project/tmc_wrs_docker#20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How could I run GPU docker? #361

How could I run GPU docker? #361

hyzhak commented Oct 28, 2018

pricebenjamin commented Dec 18, 2018

rosbo commented Dec 21, 2018

How could I run GPU docker? #361

How could I run GPU docker? #361

Comments

hyzhak commented Oct 28, 2018

pricebenjamin commented Dec 18, 2018

rosbo commented Dec 21, 2018