23
  • tf-nightly version = 2.12.0-dev2023203
  • Python version = 3.10.6
  • CUDA drivers version = 525.85.12
  • CUDA version = 12.0
  • Cudnn version = 8.5.0
  • I am using Linux (x86_64, Ubuntu 22.04)
  • I am coding in Visual Studio Code on a venv virtual environment

I am trying to run some models on the GPU (NVIDIA GeForce RTX 3050) using tensorflow nightly 2.12 (to be able to use Cuda 12.0). The problem that I have is that apparently every checking that I am making seems to be correct, but in the end the script is not able to detect the GPU. I've dedicated a lot of time trying to see what is happening and nothing seems to work, so any advice or solution will be more than welcomed. The GPU seems to be working for torch as you can see at the very end of the question.

I will show some of the most common checkings regarding CUDA that I did (Visual Studio Code terminal), I hope you find them useful:

  1. Check CUDA version:

    $ nvcc --version

    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2023 NVIDIA Corporation
    Built on Fri_Jan__6_16:45:21_PST_2023
    Cuda compilation tools, release 12.0, V12.0.140
    Build cuda_12.0.r12.0/compiler.32267302_0
    
  2. Check if the connection with the CUDA libraries is correct:

    $ echo $LD_LIBRARY_PATH

    /usr/cuda/lib
    
  3. Check nvidia drivers for the GPU and check if GPU is readable for the venv:

    $ nvidia-smi

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
    | N/A   40C    P5     6W /  20W |     46MiB /  4096MiB |     22%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |    0   N/A  N/A      1356      G   /usr/lib/xorg/Xorg                 45MiB |
    +-----------------------------------------------------------------------------+
    
  4. Add cuda/bin PATH and Check it:

    $ export PATH="/usr/local/cuda/bin:$PATH"

    $ echo $PATH

    /usr/local/cuda-12.0/bin:/home/victus-linux/Escritorio/MasterThesis_CODE/to_share/venv_master/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin
    
  5. Custom function to check if CUDA is correctly installed: [function by Sherlock]

    function lib_installed() { /sbin/ldconfig -N -v $(sed 's/:/ /' <<< $LD_LIBRARY_PATH) 2>/dev/null | grep $1; }
    function check() { lib_installed $1 && echo "$1 is installed" || echo "ERROR: $1 is NOT installed"; }
    check libcuda
    check libcudart
    
    libcudart.so.12 -> libcudart.so.12.0.146
            libcuda.so.1 -> libcuda.so.525.85.12
            libcuda.so.1 -> libcuda.so.525.85.12
            libcudadebugger.so.1 -> libcudadebugger.so.525.85.12
    libcuda is installed
            libcudart.so.12 -> libcudart.so.12.0.146
    libcudart is installed
    
  6. Custom function to check if Cudnn is correctly installed: [function by Sherlock]

    function lib_installed() { /sbin/ldconfig -N -v $(sed 's/:/ /' <<< $LD_LIBRARY_PATH) 2>/dev/null | grep $1; }
    function check() { lib_installed $1 && echo "$1 is installed" || echo "ERROR: $1 is NOT installed"; }
    check libcudnn 
    
            libcudnn_cnn_train.so.8 -> libcudnn_cnn_train.so.8.8.0
            libcudnn_cnn_infer.so.8 -> libcudnn_cnn_infer.so.8.8.0
            libcudnn_adv_train.so.8 -> libcudnn_adv_train.so.8.8.0
            libcudnn.so.8 -> libcudnn.so.8.8.0
            libcudnn_ops_train.so.8 -> libcudnn_ops_train.so.8.8.0
            libcudnn_adv_infer.so.8 -> libcudnn_adv_infer.so.8.8.0
            libcudnn_ops_infer.so.8 -> libcudnn_ops_infer.so.8.8.0
    libcudnn is installed
    

So, once I did these previous checkings I used a script to evaluate if everything was finally ok and then the following error appeared:

import tensorflow as tf

print(f'\nTensorflow version = {tf.__version__}\n')
print(f'\n{tf.config.list_physical_devices("GPU")}\n')
2023-03-02 12:05:09.463343: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-03-02 12:05:09.489911: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-03-02 12:05:09.490522: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 12:05:10.066759: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

Tensorflow version = 2.12.0-dev20230203

2023-03-02 12:05:10.748675: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-03-02 12:05:10.771263: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

[]

Extra check: I tried to run a checking script on torch and in here it worked so I guess the problem is related with tensorflow/tf-nightly

import torch

print(f'\nAvailable cuda = {torch.cuda.is_available()}')

print(f'\nGPUs availables = {torch.cuda.device_count()}')

print(f'\nCurrent device = {torch.cuda.current_device()}')

print(f'\nCurrent Device location = {torch.cuda.device(0)}')

print(f'\nName of the device = {torch.cuda.get_device_name(0)}')
Available cuda = True

GPUs availables = 1

Current device = 0

Current Device location = <torch.cuda.device object at 0x7fbe26fd2ec0>

Name of the device = NVIDIA GeForce RTX 3050 Laptop GPU

Please, if you know something that might help solve this issue, don't hesitate on telling me.

JaimeCorton
  • 232
  • 1
  • 1
  • 9
  • Hmm note that pip3 install torch brings a lot of cuda 11 packages. – arivero Mar 15 '23 at 21:07
  • 3
    tf.sysconfig.get_build_info() shows cuda 11, does it? My guess is that there is not a ship with cuda12 – arivero Mar 15 '23 at 22:05
  • 1
    @arivero That's the output of tf.sysconfig.get_build_info(): _OrderedDict([('cpu_compiler', '/dt9/usr/bin/gcc'), ('cuda_compute_capabilities', ['sm_35', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'compute_80']), ('cuda_version', '11.8'), ('cudnn_version', '8'), ('is_cuda_build', True), ('is_rocm_build', False), ('is_tensorrt_build', True)])_ . **Cuda_version is 11.8** as you mentioned. What I don't get is how it is possible that? Taking into account that the **tf nightly version was supposed to be compatible with Cuda 12**. – JaimeCorton Mar 17 '23 at 09:39
  • 1
    Yes, I see the problem, because of that I have bountied the question, in the hope that someone with knowledge can tell us if tf nightly can opt between 11 and 12 automagically or not. – arivero Mar 17 '23 at 10:55

3 Answers3

20

I think that, as of March 2023, the only tensorflow distribution for cuda 12 is the docker package from NVIDIA.

A tf package for cuda 12 should show the following info

>>> tf.sysconfig.get_build_info() 
OrderedDict([('cpu_compiler', '/usr/bin/x86_64-linux-gnu-gcc-11'), 
('cuda_compute_capabilities', ['compute_86']), 
('cuda_version', '12.0'), ('cudnn_version', '8'), 
('is_cuda_build', True), ('is_rocm_build', False), ('is_tensorrt_build', True)])

But if we run tf.sysconfig.get_build_info() on any tensorflow package installed via pip, it stills tells that cuda_version is 11.x

So your alternatives are:

  • install docker with the nvidia cloud instructions and run one of the recent containers
  • compile tensorflow from source, either nightly or last release. Caveat, it takes a lot of RAM and some time, as all good compilations do, and the occasional error to be corrected on the run. In my case, to define kFP8, the new 8-bits float.
  • wait
arivero
  • 777
  • 1
  • 9
  • 30
  • Note I am not very sure about how compatibility forward works. For instance I have a build that claims cuda 11.2 and it is working on a mixed install of 11.8 and 12.0 – arivero Apr 14 '23 at 08:50
  • 3
    A month later, and I could now switch from pip to arch repository versions by first uninstalling the pip tensorflow packages and then running `sudo pacman -S tensorflow-cuda python-tensorflow-cuda`. This gives me `'cuda_version', '12.1'` and fixes the `'libcudart.so.11.0'` loading error. – Joel Sjögren Apr 16 '23 at 16:11
  • Another option is to install cuda-11-8 from [link](https://developer.nvidia.com/cuda-11-8-0-download-archive) `https://developer.nvidia.com/cuda-11-8-0-download-archive` (for example). If you use WSL2, the CUDA installation from Windows needs to match the one from Linux distro you are running under WSL2. It worked for me with less headache than compiling TensorFlow or using docker. – George Aug 08 '23 at 19:08
4

"I experienced the same thing, and it can be resolved by installing TensorFlowRT."

  1. pip3 install nvidia-tensorrt
  2. check the libnvinfer.* file link once again, and make sure that the LD_LIBRARY_PATH points to the installation directory."
  3. refer: Could not load dynamic library 'libnvinfer.so.7'

After all the libraries are fixed, then the GPU output will be visible. GPU visible:

GPU visible

num3ri
  • 822
  • 16
  • 20
  • Tensor RT was not the problem in my case. It seems that @arivero answer points in the right direction – JaimeCorton Mar 17 '23 at 09:45
  • I think the same. Still, 50 points awarded because it could be useful for others – arivero Mar 22 '23 at 20:15
  • Not useful, it doesn't solve the problem. Can you please elaborate on how you were able to "bypass" the requirement for a different cuda version? – luisgrisolia Apr 22 '23 at 15:11
  • This might work on older graphics cards with Ubuntu Linux. I have a GTX 1050 Ti. The install command lagged for a minute and then finished. I restarted the terminal and was able to list my GPU device. – Joachim Rives Aug 22 '23 at 04:49
1

Just to add as another alternative, the official pip3 install tensorflow also doesn't work with CUDA 12, so my solution was to just go back to CUDA 11:

sudo apt install cuda-11-8

Tensorflow now works.

Ken Y-N
  • 14,644
  • 21
  • 71
  • 114
  • Sorry Ken. Voting down because I was clear enough on the initial question about what you say here. From the very beginning I was using a tf-nightly version so that I could use CUDA 12 (because precisely at that moment tensorflow was not supporting CUDA 11). Maybe your answer makes sense on another question. Have a great day. – JaimeCorton May 02 '23 at 08:58