DLS 2.5.0 does not recognize GPU - Linux


#1

Hi,
I was using DLS for a while and previously, my setup was like this: installed latest Nvidia via package manager (sudo apt install) and then installed DLS. Although DLS complained about GPU version mismatch, it was able to utilize the GPU. For instance, DLS supported 390.48 and I got 390.77 installed.
With the hope that supported driver would be better, I uninstalled 390.77 and then installed 390.48 (via the binary downloaded by DLS)
This was for DLS version 2.1.0 (or 2.2.0, not sure). Then DLS was updated to 2.5.0, now it does not run in GPU mode. The message I get is: GPU: Driver version mismatch. Running in CPU mode.

Here’s the log (first 12 lines only)

Compatible GPU driver detected
Starting redis-server: redis-server.
 * Starting nginx nginx
   ...done.
Not supported
libcuda.so.1: cannot open shared object file: No such file or directory
skipping cifar dataset download!
skipping IMDB dataset download!
skipping MNIST dataset download!
skipping reuters dataset download!
No changes detected
Using MXNet backend.

So, I have two questions:

  • What is the latest GPU driver supported by DLS 2.5.0 ? I don’t see that info in the DLS Manager window.
  • In the log, it says libcuda.so.1: cannot open shared object file, is this complainin missing libcuda.so within Docker image (deepcognitionlabs/deep-learning-studio:2.5.0) or in my local installation?

My local installation has cuda installed (as far as I can tell by nvcc output)

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

And nvidia-docker in my local machine can run GPU images:

$ nvidia-docker run --rm -it nvidia/opengl:base nvidia-smi
Tue Sep  4 09:45:48 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48                 Driver Version: 390.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 750M     Off  | 00000000:01:00.0 N/A |                  N/A |
| N/A   62C    P0    N/A /  N/A |    422MiB /  4039MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

Finally, my system is Ubuntu 18.04.1 LTS and kernel is 4.15.0-33-generic x86_64 GNU/Linux

Sorry for very long post.


#2

DLS 2.5.0 works in ubuntu 18.04 with the Nvidia-driver 396.24. Here is output from our system.

$ dpkg -l |grep nvidia-driver
ii  nvidia-driver-396                          396.24.02-0ubuntu0~gpu18.04.1       amd64        NVIDIA driver metapackage

$ nvidia-smi
Thu Sep  6 07:04:58 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.24.02              Driver Version: 396.24.02                 |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:26:00.0  On |                  N/A |
| 40%   39C    P8     7W / 120W |    497MiB /  6077MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Cuda is not required on the host system.


#3

Will it work only in 18.04? We installed on 16.04. DLS recognises gpu, but when we train, no gpus shown.


#4

No it also supports 16.04.
are you able to see the no of gpu’s available option on the top right?


#5

NO, infact i tried installing 18.04 now, but same thing. The DLS shows GPU detected, but when i go insude project and try training, it doesn’t show gpu.

root@blr1p01-gpu-001:~/DeepLearningStudio# nvidia-smi
Fri Nov 23 07:43:34 2018
±----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48 Driver Version: 390.48 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000000:62:00.0 Off | 0 |
| N/A 36C P0 39W / 300W | 10MiB / 32510MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… Off | 00000000:89:00.0 Off | 0 |
| N/A 36C P0 42W / 300W | 10MiB / 32510MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
root@blr1p01-gpu-001:~/DeepLearningStudio# dpkg -l | grep cuda
root@blr1p01-gpu-001:~/DeepLearningStudio# dpkg -l | grep nvidia
ii nvidia-docker 1.0.1-1 amd64 NVIDIA Docker container tools
root@blr1p01-gpu-001:~/DeepLearningStudio#
root@blr1p01-gpu-001:~/DeepLearningStudio# lsmod | grep nvidia
nvidia_uvm 757760 2
nvidia_drm 40960 0
nvidia_modeset 1110016 1 nvidia_drm
nvidia 14340096 77 nvidia_uvm,nvidia_modeset
ipmi_msghandler 53248 4 ipmi_devintf,ipmi_si,nvidia,ipmi_ssif
drm_kms_helper 172032 3 ast,nvidia_drm,nouveau
drm 401408 6 drm_kms_helper,ast,nvidia_drm,ttm,nouveau

Inside docker :

root@b9c70e467a6c:/home/app# cd /lib
root@b9c70e467a6c:/lib# ls -l
total 28
lrwxrwxrwx 1 root root 21 Aug 27 06:44 cpp -> /etc/alternatives/cpp
drwxr-xr-x 2 root root 4096 Aug 27 06:44 ifupdown
drwxr-xr-x 2 root root 4096 Jul 26 13:50 init
drwxr-xr-x 3 root root 4096 Jul 26 13:52 lsb
drwxr-xr-x 1 root root 4096 Jul 26 13:52 systemd
drwxr-xr-x 15 root root 4096 Feb 19 2016 terminfo
drwxr-xr-x 1 root root 4096 Aug 27 06:44 udev
drwxr-xr-x 1 root root 4096 Aug 27 06:44 x86_64-linux-gnu


#6

Guys,

I think this is a serious issue. I have both 1 gig and 10g n/w in my system. Then moment if i install everything without enabling 10G nic, DLS shows 2 gpus. If i enable 10G and install DLS after that with only 10 G functional nic, DLS doesn’t show GPU.

Also DLS stops showing GPUs after i reboot the system. I even disabled nouveau module and enabled only nvidia drivers. Nothing is helping now to make DLS see the GPUs consistently across reboots. I verified the same issue in AWS also.


#7

Logs, when gpu was not visible in DLS:

Starting Deep Learning Studio…Starting redis-server: redis-server.

  • Starting nginx nginx
    …done.
    Detected CUDA path
    Checking GPU support…GPU supported
    [13:46:08] downloading CIFAR dataset…
    [13:46:41] Downloaded CIFAR dataset.
    ./cifar-10/
    [13:46:41] downloading IMDB dataset…
    [13:46:45] Downloaded IMDB dataset.
    ./imdb/
    initiated datasets repo at: /root/.pydataset/
    Generated dataset: titanic
    ./
    Generated dataset: iris
    ./
    Pydatasets installed
    [13:46:47] downloading MNIST dataset…
    [13:47:02] Downloaded MNIST dataset.
    ./mnist/
    [13:47:02] downloading reuters dataset…
    [13:47:04] Downloaded reuters dataset.
    ./reuters/
    No changes detected
    Using MXNet backend.
    /usr/local/lib/python3.5/dist-packages/allauth/account/templatetags/account_tags.py:4: DeprecationWarning: {% load account_tags %} is deprecated, use {% load account %}
    DeprecationWarning)
    /usr/local/lib/python3.5/dist-packages/allauth/socialaccount/templatetags/socialaccount_tags.py:4: DeprecationWarning: {% load socialaccount_tags %} is deprecated, use {% load socialaccount %}
    " {% load socialaccount %}", DeprecationWarning)
    Operations to perform:
    Apply all migrations: account, admin, auth, authtoken, automl, contenttypes, environments, project, projects, reversion, sessions, sites, socialaccount
    Running migrations:
    Applying contenttypes.0001_initial… OK
    Applying auth.0001_initial… OK
    Applying account.0001_initial… OK
    Applying account.0002_email_max_length… OK
    Applying admin.0001_initial… OK
    Applying admin.0002_logentry_remove_auto_add… OK
    Applying contenttypes.0002_remove_content_type_name… OK
    Applying auth.0002_alter_permission_name_max_length… OK
    Applying auth.0003_alter_user_email_max_length… OK
    Applying auth.0004_alter_user_username_opts… OK
    Applying auth.0005_alter_user_last_login_null… OK
    Applying auth.0006_require_contenttypes_0002… OK
    Applying auth.0007_alter_validators_add_error_messages… OK
    Applying auth.0008_alter_user_username_max_length… OK
    Applying authtoken.0001_initial… OK
    Applying authtoken.0002_auto_20160226_1747… OK
    Applying automl.0001_initial… OK
    Applying environments.0001_initial… OK
    Applying project.0001_initial… OK
    Applying projects.0001_initial… OK
    Applying reversion.0001_squashed_0004_auto_20160611_1202… OK
    Applying sessions.0001_initial… OK
    Applying sites.0001_initial… OK
    Applying sites.0002_alter_domain_unique… OK
    Applying socialaccount.0001_initial… OK
    Applying socialaccount.0002_token_max_lengths… OK
    Applying socialaccount.0003_extra_data_default_dict… OK
    Using MXNet backend.
    /usr/local/lib/python3.5/dist-packages/allauth/account/templatetags/account_tags.py:4: DeprecationWarning: {% load account_tags %} is deprecated, use {% load account %}
    DeprecationWarning)
    /usr/local/lib/python3.5/dist-packages/allauth/socialaccount/templatetags/socialaccount_tags.py:4: DeprecationWarning: {% load socialaccount_tags %} is deprecated, use {% load socialaccount %}
    " {% load socialaccount %}", DeprecationWarning)
    loading initial db
    Using MXNet backend.
    /usr/local/lib/python3.5/dist-packages/allauth/account/templatetags/account_tags.py:4: DeprecationWarning: {% load account_tags %} is deprecated, use {% load account %}
    DeprecationWarning)
    /usr/local/lib/python3.5/dist-packages/allauth/socialaccount/templatetags/socialaccount_tags.py:4: DeprecationWarning: {% load socialaccount_tags %} is deprecated, use {% load socialaccount %}
    " {% load socialaccount %}", DeprecationWarning)
    Installed 2 object(s) from 1 fixture(s)
    [2018-11-23 13:47:12 +0000] [246] [INFO] Starting gunicorn 19.6.0
    [2018-11-23 13:47:12 +0000] [246] [INFO] Listening at: http://127.0.0.1:8000 (246)
    [2018-11-23 13:47:12 +0000] [246] [INFO] Using worker: threads
    [2018-11-23 13:47:12 +0000] [263] [INFO] Booting worker with pid: 263
    [I 13:47:12.701 LabApp] Writing notebook server cookie secret to /data/1/.local/share/jupyter/runtime/notebook_cookie_secret
    [I 13:47:12.932 LabApp] JupyterLab extension loaded from /usr/local/lib/python3.5/dist-packages/jupyterlab
    [I 13:47:12.932 LabApp] JupyterLab application directory is /usr/local/share/jupyter/lab
    [W 13:47:12.936 LabApp] JupyterLab server extension not enabled, manually loading…
    [I 13:47:12.937 LabApp] JupyterLab extension loaded from /usr/local/lib/python3.5/dist-packages/jupyterlab
    [I 13:47:12.937 LabApp] JupyterLab application directory is /usr/local/share/jupyter/lab
    [I 13:47:12.939 LabApp] Serving notebooks from local directory: /data/1
    [I 13:47:12.940 LabApp] The Jupyter Notebook is running at:
    [I 13:47:12.940 LabApp] http://(73b5c3b963c6 or 127.0.0.1):8888/?token=…
    [I 13:47:12.940 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
    [W 13:47:12.940 LabApp] No web browser found: could not locate runnable browser.
    Using MXNet backend.
  • Running on http://127.0.0.1:6666/ (Press CTRL+C to quit)

#8

Log when GPU was visible in DLS :

Starting Deep Learning Studio…Starting redis-server: redis-server.

  • Starting nginx nginx
    …done.
    Detected CUDA path
    Checking GPU support…GPU supported
    [17:13:40] downloading CIFAR dataset…
    [17:18:14] Downloaded CIFAR dataset.
    ./cifar-10/
    [17:18:15] downloading IMDB dataset…
    [17:18:25] Downloaded IMDB dataset.
    ./imdb/
    initiated datasets repo at: /root/.pydataset/
    Generated dataset: titanic
    ./
    Generated dataset: iris
    ./
    Pydatasets installed
    [17:18:27] downloading MNIST dataset…
    [17:18:55] Downloaded MNIST dataset.
    ./mnist/
    [17:18:55] downloading reuters dataset…
    [17:19:00] Downloaded reuters dataset.
    ./reuters/
    No changes detected
    Using MXNet backend.
    /usr/local/lib/python3.5/dist-packages/allauth/account/templatetags/account_tags.py:4: DeprecationWarning: {% load account_tags %} is deprecated, use {% load account %}
    DeprecationWarning)
    /usr/local/lib/python3.5/dist-packages/allauth/socialaccount/templatetags/socialaccount_tags.py:4: DeprecationWarning: {% load socialaccount_tags %} is deprecated, use {% load socialaccount %}
    " {% load socialaccount %}", DeprecationWarning)
    Operations to perform:
    Apply all migrations: account, admin, auth, authtoken, automl, contenttypes, environments, project, projects, reversion, sessions, sites, socialaccount
    Running migrations:
    Applying contenttypes.0001_initial… OK
    Applying auth.0001_initial… OK
    Applying account.0001_initial… OK
    Applying account.0002_email_max_length… OK
    Applying admin.0001_initial… OK
    Applying admin.0002_logentry_remove_auto_add… OK
    Applying contenttypes.0002_remove_content_type_name… OK
    Applying auth.0002_alter_permission_name_max_length… OK
    Applying auth.0003_alter_user_email_max_length… OK
    Applying auth.0004_alter_user_username_opts… OK
    Applying auth.0005_alter_user_last_login_null… OK
    Applying auth.0006_require_contenttypes_0002… OK
    Applying auth.0007_alter_validators_add_error_messages… OK
    Applying auth.0008_alter_user_username_max_length… OK
    Applying authtoken.0001_initial… OK
    Applying authtoken.0002_auto_20160226_1747… OK
    Applying automl.0001_initial… OK
    Applying environments.0001_initial… OK
    Applying project.0001_initial… OK
    Applying projects.0001_initial… OK
    Applying reversion.0001_squashed_0004_auto_20160611_1202… OK
    Applying sessions.0001_initial… OK
    Applying sites.0001_initial… OK
    Applying sites.0002_alter_domain_unique… OK
    Applying socialaccount.0001_initial… OK
    Applying socialaccount.0002_token_max_lengths… OK
    Applying socialaccount.0003_extra_data_default_dict…Using MXNet backend.
    /usr/local/lib/python3.5/dist-packages/allauth/account/templatetags/account_tags.py:4: DeprecationWarning: {% load account_tags %} is deprecated, use {% load account %}
    DeprecationWarning)
    /usr/local/lib/python3.5/dist-packages/allauth/socialaccount/templatetags/socialaccount_tags.py:4: DeprecationWarning: {% load socialaccount_tags %} is deprecated, use {% load socialaccount %}
    " {% load socialaccount %}", DeprecationWarning)
    OK
    loading initial db
    Installed 2 object(s) from 1 fixture(s)
    Using MXNet backend.
    /usr/local/lib/python3.5/dist-packages/allauth/account/templatetags/account_tags.py:4: DeprecationWarning: {% load account_tags %} is deprecated, use {% load account %}
    DeprecationWarning)
    /usr/local/lib/python3.5/dist-packages/allauth/socialaccount/templatetags/socialaccount_tags.py:4: DeprecationWarning: {% load socialaccount_tags %} is deprecated, use {% load socialaccount %}
    " {% load socialaccount %}", DeprecationWarning)
    [2018-11-23 17:19:07 +0000] [246] [INFO] Starting gunicorn 19.6.0
    [2018-11-23 17:19:07 +0000] [246] [INFO] Listening at: http://127.0.0.1:8000 (246)
    [2018-11-23 17:19:07 +0000] [246] [INFO] Using worker: threads
    [2018-11-23 17:19:07 +0000] [263] [INFO] Booting worker with pid: 263
    [I 17:19:07.906 LabApp] Writing notebook server cookie secret to /data/1/.local/share/jupyter/runtime/notebook_cookie_secret
    [I 17:19:08.204 LabApp] JupyterLab extension loaded from /usr/local/lib/python3.5/dist-packages/jupyterlab
    [I 17:19:08.204 LabApp] JupyterLab application directory is /usr/local/share/jupyter/lab
    [W 17:19:08.208 LabApp] JupyterLab server extension not enabled, manually loading…
    [I 17:19:08.209 LabApp] JupyterLab extension loaded from /usr/local/lib/python3.5/dist-packages/jupyterlab
    [I 17:19:08.209 LabApp] JupyterLab application directory is /usr/local/share/jupyter/lab
    [I 17:19:08.212 LabApp] Serving notebooks from local directory: /data/1
    [I 17:19:08.212 LabApp] The Jupyter Notebook is running at:
    [I 17:19:08.212 LabApp] http://(b7f91e67de6d or 127.0.0.1):8888/?token=…
    [I 17:19:08.212 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
    [W 17:19:08.213 LabApp] No web browser found: could not locate runnable browser.
    Using MXNet backend.
  • Running on http://127.0.0.1:6666/ (Press CTRL+C to quit)
    /usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
    “This module will be removed in 0.20.”, DeprecationWarning)
    Using MXNet backend.
    starting training process
    /usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
    “This module will be removed in 0.20.”, DeprecationWarning)
    Using MXNet backend.
    starting training process
    /usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
    “This module will be removed in 0.20.”, DeprecationWarning)
    Using MXNet backend.
    starting training process
    /usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
    “This module will be removed in 0.20.”, DeprecationWarning)
    Using MXNet backend.
    starting training process
    /usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
    “This module will be removed in 0.20.”, DeprecationWarning)
    Using MXNet backend.
    starting training process
    /usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
    “This module will be removed in 0.20.”, DeprecationWarning)
    Using MXNet backend.
    starting training process
    /usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
    “This module will be removed in 0.20.”, DeprecationWarning)
    Using MXNet backend.
    starting training process
    [17:27:19] src/operator/././cudnn_algoreg-inl.h:106: Running performance tests to find the best convolution algorithm, this can take a while… (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)

#9

Can there be any case, where we miss some env variables, so docker could not recognise underlying nvidia driver?

Also can you guys explain, how DLS gets affected once we change the NIC ?