DeepFaceLab_Linux Unworkable on modern cards e.g. 4090 or modern Distro's

I've spent a few days so far working this out.

For a 4090 I believe I need a 12.x series version of cudatoolkit. 12.0, 12.1, 12.2.

That said, I understand earlier cudatoolkit python packages inside conda such as 11.8 are compatible with 12.x series, it's just the system driver that needs 12.x. I think.

Unconventional perhaps, but I intend this thread to be a list of things we've tried, and their results so that others can watch along and contribute if they wish.

Jul 18 '23 09:07 marshalleq

My testing methodology is simple. To utilise the supported and matching versions of Python, CudaToolkit, cudnn and Conda. As I understand it each have to match, however I am not yet clear on conda as I read it both ways.

Getting my head around compatibility Up-to-date compatibility matrix for tensorflow can be found here. Up-to-date compatibility matrix for cudnn can be found here. Official Nvidia cuda compatibility page here. CUDA toolkit release notes including required driver versions for CUDA here.

Key notes from the official Nvidia driver page: From CUDA 11 onwards, applications compiled with a CUDA Toolkit release from within a CUDA major release family can run, with limited feature-set, on systems having at least the minimum required driver version as indicated below. This minimum required driver can be different from the driver packaged with the CUDA Toolkit but should belong to the same major release.

Also: _As described, applications that directly rely only on the CUDA runtime can be deployed in the following two scenarios:

CUDA driver that’s installed on the system is newer than the runtime.
CUDA runtime is newer than the CUDA driver on the system but they are from the same major release of CUDA Toolkit._

Minor version compatibility has another benefit that offers flexibility in the use and deployment of libraries. Applications that use libraries that support minor version compatibility can be deployed on systems with a different version of the toolkit and libraries without recompiling the application for the difference in the library version. This holds true for both older and newer versions of the libraries provided they are all from the same major release family. Note that libraries themselves have interdependencies that should be considered. For example, each cuDNN version requires a certain version of cuBLAS.

If an application is unable to leverage the minor version compatibility due to any of the aforementioned reasons, then the Forward Compatibility model can be used as an alternative even though Forward Compatibility is mainly intended for compatibility across major toolkit versions. OK so you can run newer applications on older cards. Not really my situation so will ignore for now.

Jul 19 '23 06:07 marshalleq

So the default installation is created as follows:

conda create -n deepfacelab -c main python=3.7 cudnn=7.6.5 cudatoolkit=10.1.243

Which on my system (Ubuntu 22.0-4) installs the following packages: _libgcc_mutex main/linux-64::_libgcc_mutex-0.1-main _openmp_mutex main/linux-64::_openmp_mutex-5.1-1_gnu ca-certificates main/linux-64::ca-certificates-2023.05.30-h06a4308_0 certifi main/linux-64::certifi-2022.12.7-py37h06a4308_0 cudatoolkit main/linux-64::cudatoolkit-10.1.243-h6bb024c_0 cudnn main/linux-64::cudnn-7.6.5-cuda10.1_0 ld_impl_linux-64 main/linux-64::ld_impl_linux-64-2.38-h1181459_1 libffi main/linux-64::libffi-3.4.4-h6a678d5_0 libgcc-ng main/linux-64::libgcc-ng-11.2.0-h1234567_1 libgomp main/linux-64::libgomp-11.2.0-h1234567_1 libstdcxx-ng main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 ncurses main/linux-64::ncurses-6.4-h6a678d5_0 openssl main/linux-64::openssl-1.1.1u-h7f8727e_0 pip main/linux-64::pip-22.3.1-py37h06a4308_0 python main/linux-64::python-3.7.16-h7a1cb2a_0 readline main/linux-64::readline-8.2-h5eee18b_0 setuptools main/linux-64::setuptools-65.6.3-py37h06a4308_0 sqlite main/linux-64::sqlite-3.41.2-h5eee18b_0 tk main/linux-64::tk-8.6.12-h1ccaba5_0 wheel main/linux-64::wheel-0.38.4-py37h06a4308_0 xz main/linux-64::xz-5.4.2-h5eee18b_0 zlib main/linux-64::zlib-1.2.13-h5eee18b_0

And as per requirements-cuda.txt tqdm numpy==1.19.3 numexpr h5py==2.10.0 opencv-python==4.1.0.25 ffmpeg-python==0.1.17 scikit-image==0.14.2 scipy==1.4.1 colorama tensorflow-gpu==2.4.0 pyqt5

This will install the following versions: MarkupSafe-2.1.3 PyQt5-Qt5-5.15.2 PyQt5-sip-12.12.1 PyWavelets-1.3.0 absl-py-0.15.0 astunparse-1.6.3 cachetools-5.3.1 charset-normalizer-3.2.0 cloudpickle-2.2.1 colorama-0.4.6 cycler-0.11.0 dask-2022.2.0 ffmpeg-python-0.1.17 flatbuffers-1.12 fonttools-4.38.0 fsspec-2023.1.0 future-0.18.3 gast-0.3.3 google-auth-2.22.0 google-auth-oauthlib-0.4.6 google-pasta-0.2.0 grpcio-1.32.0 h5py-2.10.0 idna-3.4 importlib-metadata-6.7.0 keras-preprocessing-1.1.2 kiwisolver-1.4.4 locket-1.0.0 markdown-3.4.3 matplotlib-3.5.3 networkx-2.6.3 numexpr-2.8.4 numpy-1.19.3 oauthlib-3.2.2 onnx-1.14.0 opencv-python-4.1.0.25 opt-einsum-3.3.0 packaging-23.1 partd-1.4.0 pillow-9.5.0 protobuf-3.20.3 pyasn1-0.5.0 pyasn1-modules-0.3.0 pyparsing-3.1.0 pyqt5-5.15.9 python-dateutil-2.8.2 pyyaml-6.0. requests-2.31.0 requests-oauthlib-1.3. rsa-4.9 scikit-image-0.14.2 scipy-1.4.1 six-1.15.0 tensorboard-2.11.2 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.1 tensorflow-estimator-2.4.0 tensorflow-gpu-2.4.0 termcolor-1.1.0 tf2onnx-1.9.3 toolz-0.12.0 tqdm-4.65.0 typing-extensions-3.7.4.3 urllib3-1.26.16 werkzeug-2.2.3 wrapt-1.12.1 zipp-3.15.0

Jul 19 '23 07:07 marshalleq

On my system I have: python=3.10 cudnn=8.9.3 cudatoolkit=12.2 +a golden Nvidia 4090 that I'm even jealous of myself for having :)

According to the support Matrix (as I'm reading it anyway), Hopper hardware architecture requires CUDA 11.8 - I'm assuming this is a system driver though but I'm not sure. This is quite confusing and I'm going to need to understand it more fully.

Additionally this page says for hopper architecture I can use Cuda 12.x and cudnn 8.9.3

On top of all this there's something called cuda compute capability needing to be >v9 which I have no idea what that is yet, just that it's not cudnn because if it is, it's more conflicting information, what a minefield. I'm really hoping someone will jump in here and help explain all of this.

Jul 19 '23 08:07 marshalleq

A trial installation utilising advice from the issue tracker The advice is to downgrade tensorflow-gpu to 2.3.1 and there are comments that it needs to go even lower to get it to work for some.

I note the package cudnn main/linux-64::cudnn-7.6.5-cuda10.1_0 which according to the tested build matrix is only 'officially' supported by python 3.5 because it's cuda 10.1, also there is not minor version compatibility with 10.1 builds. So this is unexpected. I'm guessing it's custom compiled.

Using the default installation method outlined above, no GPU is detected. CPU does work but obviously is very slow. If I consider the advice to downgrade tensorflow as per above from 2.4.0: python -m pip install tensorflow-gpu==2.3.1

The following happens:

Process Process-1:
Traceback (most recent call last):
  File "/home/quentinj/anaconda3/envs/deepfacelab/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/quentinj/anaconda3/envs/deepfacelab/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/core/leras/device.py", line 123, in _get_tf_devices_proc
    physical_devices = device_lib.list_local_devices()
  File "/home/quentinj/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/device_lib.py", line 43, in list_local_devices
    _convert(s) for s in _pywrap_device_lib.list_devices(serialized_config)
RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid]

If I further downgrade python -m pip install tensorflow-gpu==2.2.1 Things appear to be better in that the GPU is now detected. The model pauses for a very long time (5mins plus) at the initializing step 2/5 - 40% with a single core maxed out at 100% for python3.7 and no or little GPU activity, but then eventually proceeds to loading samples and the training Starting window, I can only assume python is compiling some library or downloading something because this behaviour disappears on subsequent attempts.

At the Starting Press Enter to stop training, again there is no noteable GPU activity with python3.7 activating a single cpu core to 100%, Python does have 2.5G memory in the GPU active, but there is a setting that tells it to utilise that memory for this exercise, so this does not necessarily mean that the GPU is going to do anything.

I can also confirm the model summary window has identified the GPU correctly and is no longer running on GPU. Last time I left this for about 30 minutes and nothing progressed. I am again seeing this behavior and am assuming if I leave it long enough the same situation as above will happen where it will eventually proceed. If it would only use multiple cores we might have a chance of it finishing this century.

OK so it did eventually finish and unlike the CPU mode, I now have corrupted previews in the training preview window. There is the following error message which by reading it probably means the training is functioning correctly and only the preview window is not working. I can confirm that the GPU is active simply by observing the power draw in nvidia-smi. I can also see 100% usage on all 32 threads of my cpu. Not sure that that's meant to be happening.

Traceback (most recent call last):an].4468]
  File "./DeepFaceLab/main.py", line 348, in <module>
    arguments.func(arguments)
  File "./DeepFaceLab/main.py", line 137, in process_train
    Trainer.main(**kwargs)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/mainscripts/Trainer.py", line 317, in main
    lh_img = models.ModelBase.get_loss_history_preview(loss_history_to_show, iter, w, c)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/models/ModelBase.py", line 627, in get_loss_history_preview
    ph_max = int ( (plist_max[col][p] / plist_abs_max) * (lh_height-1) )
ValueError: cannot convert float NaN to integer

I did do a quick search on the ValueError and it seems this could be fixed with some minor code fixes, that may be an avenue, however this whole scenario is not at all desired behaviour and after some time I also realised I am not able to interact with the Training Preview window, unable to press any key and unable to save the model. Ideally, we'd be able to get this working without any code changes, or minor code changes only. Sadly I'm not a developer so this will take time to figure out. Where to from here?

Possible valid configurations So given that this isn't working - from the above version compatibility charts it would appear that either, I need to do one of the following:

Match the installed python3.7 and align tensorflow 2.70 to tensorflow 2.11.0 along with cudnn 8.1 and Cuda 11.2
Use tensorflow 2.13.0, python 3.8, cudnn 8.6, and CUDA 11.8 -(I see there is an easy way to set the python version with the included env.sh).
Understand if any of this is needed at all and if a simple code change would add support somehow.

After all this, I am still unsure if an Nvidia 4000 series will run on earlier versions of python, cuda libraries, cudnn and tensorflow or not. Again, I hope this information provides some clues to someone with more skills in this space than me and they can report their progress back here.

Jul 19 '23 08:07 marshalleq

Just noticing there is a method for the MVE fork in this thread - has better options and is newer.

conda create -n deepfacelab -c main python=3.7 cudnn=8.2.1 cudatoolkit=11.3.1

For reference the following packages are installed:

_libgcc_mutex      main/linux-64::_libgcc_mutex-0.1-main 
 _openmp_mutex      main/linux-64::_openmp_mutex-5.1-1_gnu 
 ca-certificates    main/linux-64::ca-certificates-2023.05.30-h06a4308_0 
 certifi            main/linux-64::certifi-2022.12.7-py37h06a4308_0 
 cudatoolkit        main/linux-64::cudatoolkit-11.3.1-h2bc3f7f_2 
 cudnn              main/linux-64::cudnn-8.2.1-cuda11.3_0 
 ld_impl_linux-64   main/linux-64::ld_impl_linux-64-2.38-h1181459_1 
 libffi             main/linux-64::libffi-3.4.4-h6a678d5_0 
 libgcc-ng          main/linux-64::libgcc-ng-11.2.0-h1234567_1 
 libgomp            main/linux-64::libgomp-11.2.0-h1234567_1 
 libstdcxx-ng       main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
 ncurses            main/linux-64::ncurses-6.4-h6a678d5_0 
 openssl            main/linux-64::openssl-1.1.1u-h7f8727e_0 
 pip                main/linux-64::pip-22.3.1-py37h06a4308_0 
 python             main/linux-64::python-3.7.16-h7a1cb2a_0 
 readline           main/linux-64::readline-8.2-h5eee18b_0 
 setuptools         main/linux-64::setuptools-65.6.3-py37h06a4308_0 
 sqlite             main/linux-64::sqlite-3.41.2-h5eee18b_0 
 tk                 main/linux-64::tk-8.6.12-h1ccaba5_0 
 wheel              main/linux-64::wheel-0.38.4-py37h06a4308_0 
 xz                 main/linux-64::xz-5.4.2-h5eee18b_0 
 zlib               main/linux-64::zlib-1.2.13-h5eee18b_0

requirements-cuda.txt is different

tqdm
numpy==1.19.3
numexpr
h5py==3.1.0
opencv-python==4.1.0.25
ffmpeg-python==0.1.17
scikit-image==0.14.2
scipy==1.4.1
colorama
tensorflow
pyqt5
tf2onnx==1.9.3
Flask==1.1.1
flask-socketio==4.2.1
tensorboardX
crc32c
jsonschema

Interestingly it downloads many multiple versions of tensorflow maybe 14 so far, this is crazy (and multiples of a few other packages) maybe that will fix things up. Despite that though, it only acknowledges one version at the end. Successfully installed Flask-1.1.1 Jinja2-3.1.2 MarkupSafe-2.1.3 PyQt5-Qt5-5.15.2 PyQt5-sip-12.12.1 PyWavelets-1.3.0 Werkzeug-2.2.3 absl-py-1.4.0 astunparse-1.6.3 attrs-23.1.0 bidict-0.22.1 cached-property-1.5.2 cachetools-5.3.1 charset-normalizer-3.2.0 click-8.1.6 cloudpickle-2.2.1 colorama-0.4.6 crc32c-2.3.post0 cycler-0.11.0 dask-2022.2.0 ffmpeg-python-0.1.17 flask-socketio-4.2.1 flatbuffers-1.12 fonttools-4.38.0 fsspec-2023.1.0 future-0.18.3 gast-0.4.0 google-auth-2.22.0 google-auth-oauthlib-0.4.6 google-pasta-0.2.0 grpcio-1.56.0 h5py-3.1.0 idna-3.4 importlib-metadata-6.7.0 importlib-resources-5.12.0 itsdangerous-2.1.2 jsonschema-4.17.3 keras-2.7.0 keras-preprocessing-1.1.2 kiwisolver-1.4.4 libclang-16.0.6 locket-1.0.0 markdown-3.4.3 matplotlib-3.5.3 networkx-2.6.3 numexpr-2.8.4 numpy-1.19.3 oauthlib-3.2.2 onnx-1.12.0 opencv-python-4.1.0.25 opt-einsum-3.3.0 packaging-23.1 partd-1.4.0 pillow-9.5.0 pkgutil-resolve-name-1.3.10 protobuf-3.19.6 pyasn1-0.5.0 pyasn1-modules-0.3.0 pyparsing-3.1.0 pyqt5-5.15.9 pyrsistent-0.19.3 python-dateutil-2.8.2 python-engineio-4.5.1 python-socketio-5.8.0 pyyaml-6.0.1 requests-2.31.0 requests-oauthlib-1.3.1 rsa-4.9 scikit-image-0.14.2 scipy-1.4.1 six-1.16.0 tensorboard-2.11.2 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.1 tensorboardX-2.6 tensorflow-2.7.4 tensorflow-estimator-2.7.0 tensorflow-io-gcs-filesystem-0.32.0 termcolor-2.3.0 tf2onnx-1.9.3 toolz-0.12.0 tqdm-4.65.0 typing-extensions-4.7.1 urllib3-1.26.16 wrapt-1.15.0 zipp-3.15.0

But does it work?

It does, but only on CPU. Sigh. What is going on here. I can only assume that the cuda libraries installed are not compatible for 4000 series cards. But I will try with this MVE fork again another night.

Jul 19 '23 10:07 marshalleq

Reading online apparently 11.8 is minimum version for 4090. So Tensorflow must be 2.12.0 or 2.13.0 and utilise cudnn 8.6 and that requires python 3.8. It is STILL unclear if they mean the system cuda driver or the installed cudatoolkit. They really need to distinguish those better. conda create -n deepfacelab -c main python=3.8 cudnn=8.6 cudatoolkit=11.8 However I clearly need a more specific version of cudnn because 8.6 doesn't install. Using anaconda's search page, it doesn't list any 8.6 version but neither does it list the previous one that I successfully installed, further it doesn't even search versions for the ones it does displays under a straight cudnn search. How to find what versions of 8.6 it has?

Found a non official version by just scrolling: conda install -c cudistas cudnn and also has cudatoolkit 11.8. Not sure how to install these from one command, so: conda create -n deepfacelab -c main python=3.8 conda install -c cudistas cudatoolkit conda install -c cudistas cudnn Both of the cudistas installs complain about package inconsistency so not ideal. And the result is that it installed cudnn 8.9.2.26 and cuda 11.0. This really is a mystery.

But for fun I will try it anyway. This clearly will need to be updated but lets see what it says python -m pip install -r ./DeepFaceLab/requirements-cuda.txt That didn't work, so just removed all versions and let it resolve itself. For reference it installed:

Requirement already satisfied: tqdm in /home/quentinj/.local/lib/python3.10/site-packages (from -r ./DeepFaceLab/requirements-cuda.txt (line 1)) (4.65.0)
Requirement already satisfied: numpy in /home/quentinj/.local/lib/python3.10/site-packages (from -r ./DeepFaceLab/requirements-cuda.txt (line 2)) (1.22.0)
Requirement already satisfied: numexpr in /home/quentinj/anaconda3/lib/python3.10/site-packages (from -r ./DeepFaceLab/requirements-cuda.txt (line 3)) (2.8.4)
Requirement already satisfied: h5py in /home/quentinj/.local/lib/python3.10/site-packages (from -r ./DeepFaceLab/requirements-cuda.txt (line 4)) (3.9.0)
Requirement already satisfied: opencv-python in /home/quentinj/.local/lib/python3.10/site-packages (from -r ./DeepFaceLab/requirements-cuda.txt (line 5)) (4.8.0.74)
Requirement already satisfied: ffmpeg-python in /home/quentinj/.local/lib/python3.10/site-packages (from -r ./DeepFaceLab/requirements-cuda.txt (line 6)) (0.2.0)
Requirement already satisfied: scikit-image in /home/quentinj/.local/lib/python3.10/site-packages (from -r ./DeepFaceLab/requirements-cuda.txt (line 7)) (0.21.0)
Requirement already satisfied: scipy in /home/quentinj/.local/lib/python3.10/site-packages (from -r ./DeepFaceLab/requirements-cuda.txt (line 8)) (1.11.1)
Requirement already satisfied: colorama in /home/quentinj/anaconda3/lib/python3.10/site-packages (from -r ./DeepFaceLab/requirements-cuda.txt (line 9)) (0.4.6)
Requirement already satisfied: tensorflow in /home/quentinj/anaconda3/lib/python3.10/site-packages (from -r ./DeepFaceLab/requirements-cuda.txt (line 10)) (2.8.0)
Requirement already satisfied: pyqt5 in /home/quentinj/anaconda3/lib/python3.10/site-packages (from -r ./DeepFaceLab/requirements-cuda.txt (line 11)) (5.15.9)
Requirement already satisfied: tf2onnx in /home/quentinj/anaconda3/lib/python3.10/site-packages (from -r ./DeepFaceLab/requirements-cuda.txt (line 12)) (1.14.0)
Requirement already satisfied: Flask in /home/quentinj/.local/lib/python3.10/site-packages (from -r ./DeepFaceLab/requirements-cuda.txt (line 13)) (2.3.2)
Collecting flask-socketio (from -r ./DeepFaceLab/requirements-cuda.txt (line 14))
  Downloading Flask_SocketIO-5.3.4-py3-none-any.whl (17 kB)
Collecting tensorboardX (from -r ./DeepFaceLab/requirements-cuda.txt (line 15))
  Obtaining dependency information for tensorboardX from https://files.pythonhosted.org/packages/02/bd/673947dde6b3a43f4ffc3abaf103947c4fb574ac8b7c32747f2421f1f7c9/tensorboardX-2.6.1-py2.py3-none-any.whl.metadata
  Downloading tensorboardX-2.6.1-py2.py3-none-any.whl.metadata (5.6 kB)
Collecting crc32c (from -r ./DeepFaceLab/requirements-cuda.txt (line 16))
  Downloading crc32c-2.3.post0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (43 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.2/43.2 kB 4.7 MB/s eta 0:00:00
Requirement already satisfied: jsonschema in /home/quentinj/.local/lib/python3.10/site-packages (from -r ./DeepFaceLab/requirements-cuda.txt (line 17)) (4.17.3)
Requirement already satisfied: future in /home/quentinj/.local/lib/python3.10/site-packages (from ffmpeg-python->-r ./DeepFaceLab/requirements-cuda.txt (line 6)) (0.18.3)
Requirement already satisfied: networkx>=2.8 in /home/quentinj/.local/lib/python3.10/site-packages (from scikit-image->-r ./DeepFaceLab/requirements-cuda.txt (line 7)) (2.8.8)
Requirement already satisfied: pillow>=9.0.1 in /home/quentinj/anaconda3/lib/python3.10/site-packages (from scikit-image->-r ./DeepFaceLab/requirements-cuda.txt (line 7)) (9.4.0)
Requirement already satisfied: imageio>=2.27 in /home/quentinj/.local/lib/python3.10/site-packages (from scikit-image->-r ./DeepFaceLab/requirements-cuda.txt (line 7)) (2.31.1)
Requirement already satisfied: tifffile>=2022.8.12 in /home/quentinj/.local/lib/python3.10/site-packages (from scikit-image->-r ./DeepFaceLab/requirements-cuda.txt (line 7)) (2023.4.12)
Requirement already satisfied: PyWavelets>=1.1.1 in /home/quentinj/.local/lib/python3.10/site-packages (from scikit-image->-r ./DeepFaceLab/requirements-cuda.txt (line 7)) (1.4.1)
Requirement already satisfied: packaging>=21 in /home/quentinj/.local/lib/python3.10/site-packages (from scikit-image->-r ./DeepFaceLab/requirements-cuda.txt (line 7)) (23.1)
Requirement already satisfied: lazy_loader>=0.2 in /home/quentinj/.local/lib/python3.10/site-packages (from scikit-image->-r ./DeepFaceLab/requirements-cuda.txt (line 7)) (0.2)
Requirement already satisfied: absl-py>=0.4.0 in /home/quentinj/.local/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (1.4.0)
Requirement already satisfied: astunparse>=1.6.0 in /home/quentinj/.local/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (1.6.3)
Requirement already satisfied: flatbuffers>=1.12 in /home/quentinj/anaconda3/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (2.0.7)
Requirement already satisfied: gast>=0.2.1 in /home/quentinj/.local/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (0.4.0)
Requirement already satisfied: google-pasta>=0.1.1 in /home/quentinj/.local/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (0.2.0)
Requirement already satisfied: keras-preprocessing>=1.1.1 in /home/quentinj/anaconda3/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (1.1.2)
Requirement already satisfied: libclang>=9.0.1 in /home/quentinj/.local/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (16.0.0)
Requirement already satisfied: opt-einsum>=2.3.2 in /home/quentinj/.local/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (3.3.0)
Requirement already satisfied: protobuf>=3.9.2 in /home/quentinj/.local/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (4.23.3)
Requirement already satisfied: setuptools in /home/quentinj/anaconda3/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (68.0.0)
Requirement already satisfied: six>=1.12.0 in /home/quentinj/anaconda3/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (1.16.0)
Requirement already satisfied: termcolor>=1.1.0 in /home/quentinj/.local/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (2.3.0)
Requirement already satisfied: typing-extensions>=3.6.6 in /home/quentinj/.local/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (4.5.0)
Requirement already satisfied: wrapt>=1.11.0 in /home/quentinj/.local/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (1.12.1)
Requirement already satisfied: tensorboard<2.9,>=2.8 in /home/quentinj/anaconda3/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (2.8.0)
Requirement already satisfied: tf-estimator-nightly==2.8.0.dev2021122109 in /home/quentinj/anaconda3/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (2.8.0.dev2021122109)
Requirement already satisfied: keras<2.9,>=2.8.0rc0 in /home/quentinj/anaconda3/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (2.8.0)
Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /home/quentinj/.local/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (0.32.0)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in /home/quentinj/.local/lib/python3.10/site-packages (from tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (1.56.0)
Requirement already satisfied: PyQt5-sip<13,>=12.11 in /home/quentinj/anaconda3/lib/python3.10/site-packages/PyQt5_sip-12.11.0-py3.10-linux-x86_64.egg (from pyqt5->-r ./DeepFaceLab/requirements-cuda.txt (line 11)) (12.11.0)
Requirement already satisfied: PyQt5-Qt5>=5.15.2 in /home/quentinj/anaconda3/lib/python3.10/site-packages (from pyqt5->-r ./DeepFaceLab/requirements-cuda.txt (line 11)) (5.15.2)
Requirement already satisfied: onnx>=1.4.1 in /home/quentinj/anaconda3/lib/python3.10/site-packages (from tf2onnx->-r ./DeepFaceLab/requirements-cuda.txt (line 12)) (1.14.0)
Requirement already satisfied: requests in /home/quentinj/.local/lib/python3.10/site-packages (from tf2onnx->-r ./DeepFaceLab/requirements-cuda.txt (line 12)) (2.31.0)
Requirement already satisfied: Werkzeug>=2.3.3 in /home/quentinj/.local/lib/python3.10/site-packages (from Flask->-r ./DeepFaceLab/requirements-cuda.txt (line 13)) (2.3.6)
Requirement already satisfied: Jinja2>=3.1.2 in /home/quentinj/.local/lib/python3.10/site-packages (from Flask->-r ./DeepFaceLab/requirements-cuda.txt (line 13)) (3.1.2)
Requirement already satisfied: itsdangerous>=2.1.2 in /home/quentinj/.local/lib/python3.10/site-packages (from Flask->-r ./DeepFaceLab/requirements-cuda.txt (line 13)) (2.1.2)
Requirement already satisfied: click>=8.1.3 in /home/quentinj/.local/lib/python3.10/site-packages (from Flask->-r ./DeepFaceLab/requirements-cuda.txt (line 13)) (8.1.3)
Requirement already satisfied: blinker>=1.6.2 in /home/quentinj/.local/lib/python3.10/site-packages (from Flask->-r ./DeepFaceLab/requirements-cuda.txt (line 13)) (1.6.2)
Collecting python-socketio>=5.0.2 (from flask-socketio->-r ./DeepFaceLab/requirements-cuda.txt (line 14))
  Using cached python_socketio-5.8.0-py3-none-any.whl (56 kB)
Requirement already satisfied: attrs>=17.4.0 in /home/quentinj/.local/lib/python3.10/site-packages (from jsonschema->-r ./DeepFaceLab/requirements-cuda.txt (line 17)) (23.1.0)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /home/quentinj/.local/lib/python3.10/site-packages (from jsonschema->-r ./DeepFaceLab/requirements-cuda.txt (line 17)) (0.19.3)
Requirement already satisfied: wheel<1.0,>=0.23.0 in /home/quentinj/anaconda3/lib/python3.10/site-packages (from astunparse>=1.6.0->tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (0.40.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/quentinj/.local/lib/python3.10/site-packages (from Jinja2>=3.1.2->Flask->-r ./DeepFaceLab/requirements-cuda.txt (line 13)) (2.1.3)
Collecting bidict>=0.21.0 (from python-socketio>=5.0.2->flask-socketio->-r ./DeepFaceLab/requirements-cuda.txt (line 14))
  Using cached bidict-0.22.1-py3-none-any.whl (35 kB)
Collecting python-engineio>=4.3.0 (from python-socketio>=5.0.2->flask-socketio->-r ./DeepFaceLab/requirements-cuda.txt (line 14))
  Obtaining dependency information for python-engineio>=4.3.0 from https://files.pythonhosted.org/packages/c1/b5/e555067d8dd44b5bccbd17f1ca4fdadd2e4defbb0022a296030d76293d25/python_engineio-4.5.1-py3-none-any.whl.metadata
  Downloading python_engineio-4.5.1-py3-none-any.whl.metadata (2.2 kB)
Requirement already satisfied: google-auth<3,>=1.6.3 in /home/quentinj/.local/lib/python3.10/site-packages (from tensorboard<2.9,>=2.8->tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (2.21.0)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /home/quentinj/anaconda3/lib/python3.10/site-packages (from tensorboard<2.9,>=2.8->tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (0.4.6)
Requirement already satisfied: markdown>=2.6.8 in /home/quentinj/.local/lib/python3.10/site-packages (from tensorboard<2.9,>=2.8->tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (3.4.3)
Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /home/quentinj/anaconda3/lib/python3.10/site-packages (from tensorboard<2.9,>=2.8->tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (0.6.1)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /home/quentinj/anaconda3/lib/python3.10/site-packages (from tensorboard<2.9,>=2.8->tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (1.8.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/quentinj/.local/lib/python3.10/site-packages (from requests->tf2onnx->-r ./DeepFaceLab/requirements-cuda.txt (line 12)) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /home/quentinj/.local/lib/python3.10/site-packages (from requests->tf2onnx->-r ./DeepFaceLab/requirements-cuda.txt (line 12)) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/quentinj/.local/lib/python3.10/site-packages (from requests->tf2onnx->-r ./DeepFaceLab/requirements-cuda.txt (line 12)) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in /home/quentinj/.local/lib/python3.10/site-packages (from requests->tf2onnx->-r ./DeepFaceLab/requirements-cuda.txt (line 12)) (2023.5.7)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /home/quentinj/.local/lib/python3.10/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.9,>=2.8->tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (5.3.1)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /home/quentinj/.local/lib/python3.10/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.9,>=2.8->tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (0.3.0)
Requirement already satisfied: rsa<5,>=3.1.4 in /home/quentinj/.local/lib/python3.10/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.9,>=2.8->tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (4.9)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /home/quentinj/.local/lib/python3.10/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.9,>=2.8->tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (1.3.1)
Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /home/quentinj/.local/lib/python3.10/site-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.9,>=2.8->tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (0.5.0)
Requirement already satisfied: oauthlib>=3.0.0 in /home/quentinj/anaconda3/lib/python3.10/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.9,>=2.8->tensorflow->-r ./DeepFaceLab/requirements-cuda.txt (line 10)) (3.2.2)
Using cached tensorboardX-2.6.1-py2.py3-none-any.whl (101 kB)
Using cached python_engineio-4.5.1-py3-none-any.whl (53 kB)
DEPRECATION: torchsde 0.2.5 has a non-standard dependency specifier numpy>=1.19.*; python_version >= "3.7". pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of torchsde or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063
Installing collected packages: crc32c, tensorboardX, python-engineio, bidict, python-socketio, flask-socketio
Successfully installed bidict-0.22.1 crc32c-2.3.post0 flask-socketio-5.3.4 python-engineio-4.5.1 python-socketio-5.8.0 tensorboardX-2.6.1

Edit scripts/env.sh to reflect python 3.8 except that all this consistency has installed python 3.10.9! OK let's change to that.

bash 6_train_SAEHD.sh

Of course this doesn't work.

Process Process-1:
Traceback (most recent call last):
  File "/home/quentinj/anaconda3/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/quentinj/anaconda3/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/core/leras/device.py", line 102, in _get_tf_devices_proc
    import tensorflow
  File "/home/quentinj/anaconda3/lib/python3.10/site-packages/tensorflow/__init__.py", line 37, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "/home/quentinj/anaconda3/lib/python3.10/site-packages/tensorflow/python/__init__.py", line 37, in <module>
    from tensorflow.python.eager import context
  File "/home/quentinj/anaconda3/lib/python3.10/site-packages/tensorflow/python/eager/context.py", line 29, in <module>
    from tensorflow.core.framework import function_pb2
  File "/home/quentinj/anaconda3/lib/python3.10/site-packages/tensorflow/core/framework/function_pb2.py", line 16, in <module>
    from tensorflow.core.framework import attr_value_pb2 as tensorflow_dot_core_dot_framework_dot_attr__value__pb2
  File "/home/quentinj/anaconda3/lib/python3.10/site-packages/tensorflow/core/framework/attr_value_pb2.py", line 16, in <module>
    from tensorflow.core.framework import tensor_pb2 as tensorflow_dot_core_dot_framework_dot_tensor__pb2
  File "/home/quentinj/anaconda3/lib/python3.10/site-packages/tensorflow/core/framework/tensor_pb2.py", line 16, in <module>
    from tensorflow.core.framework import resource_handle_pb2 as tensorflow_dot_core_dot_framework_dot_resource__handle__pb2
  File "/home/quentinj/anaconda3/lib/python3.10/site-packages/tensorflow/core/framework/resource_handle_pb2.py", line 16, in <module>
    from tensorflow.core.framework import tensor_shape_pb2 as tensorflow_dot_core_dot_framework_dot_tensor__shape__pb2
  File "/home/quentinj/anaconda3/lib/python3.10/site-packages/tensorflow/core/framework/tensor_shape_pb2.py", line 36, in <module>
    _descriptor.FieldDescriptor(
  File "/home/quentinj/.local/lib/python3.10/site-packages/google/protobuf/descriptor.py", line 561, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

Will have to try again later.

Jul 19 '23 20:07 marshalleq

According to this thread it works on 4090 under Suse linux.

So, conda create -n deepfacelab -c main -c "nvidia/label/cuda-11.8.0" -c conda-forge python=3.7 cudnn=8 cuda-toolkit=11.8 which works

It does install tensorflow 2.11 which is good. I don't know what zabique requirements are, that must be a typo, so installing normal requirements.cuda.txt Note requirements have tensorflow-gpu in them which is no longer required from tensorflow 2 series Add flatbuffers>=2.0 into requirements.txt and remove tensorflow

This all compiles, but again, no GPU. This is madness.

Jul 20 '23 05:07 marshalleq

I'm not sure if it will work for you(I'm running a 3060) but YES THIS IS A HEADACHE.

I can't tell who to blame really either :) For me, I run a different process to make this work. Maybe it will activate yours too.

#Create Conda env with Python 3.9 conda create -n deepfacelab -c main -c "nvidia/label/cuda-11.8.0" -c conda-forge python=3.9 cudnn=8 cuda-toolkit=11.8

#Install tensorflow as tensorflow says to do it - https://www.tensorflow.org/install/pip#linux

conda install -c conda-forge cudatoolkit=11.8.0
python3 -m pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.12.*
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/:$CUDNN_PATH/lib:$LD_LIBRARY_PATH' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
# Verify install:
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

This should return a detected GPU, if not....something else is missing

Update the requirements.txt file:

colorama
ffmpeg-python==0.2.0
h5py==3.8.0
numexpr
numpy==1.22.3
opencv-python==4.7.0.72
pyqt5
scikit-image==0.21.0
scipy==1.10.1
tf2onnx==1.14.0
tqdm

Install the new requirements, and I'm able to start working pip install -r requirements-cuda.txt

Jul 20 '23 15:07 gteachey

I just recalled, at some point, I did this as well. Seems a little drastic but maybe it will help if the steps I gave don't work

https://discuss.tensorflow.org/t/gpu-with-cuda-11-8-not-detected-could-not-find-cuda-drivers/16085/3

Jul 20 '23 15:07 gteachey

Thankyou very much. I will try. I did come across one article from someone with a 3060 that looks a bit similar to a 3060, which I tried, but before I started documenting it all here, I had already tried quite a lot that wasn't written down. And it seems I need to write this down because it's way too confusing without knowing what has been tried. Will report back, thanks!

Jul 21 '23 00:07 marshalleq

I had to change requirements-cuda.txt to numpy==1.22.0 instead of 1.22.3 And also downgrade to tiktoken-0.3.1 to get it all to install otherwise installations is perfect. GPU is detected at the step you mention (something I haven't actually been doing now I think about it - I've only been sharing whether Deepfacelab detects GPU.

I got an error of course, why wouldn't I? LOL

Traceback (most recent call last):
  File "/home/quentinj/DeepFaceLab_Linux/./DeepFaceLab/main.py", line 9, in <module>
    from core.leras import nn
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/core/leras/__init__.py", line 1, in <module>
    from .nn import nn
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/core/leras/nn.py", line 26, in <module>
    from core.interact import interact as io
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/core/interact/__init__.py", line 1, in <module>
    from .interact import interact
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/core/interact/interact.py", line 8, in <module>
    import colorama
ModuleNotFoundError: No module named 'colorama'

OK so this is something I learnt, pip install colorama is not the same as python3 -m pip install coloroma, the latter solves the above, the former does not.

Also had to do this for cv2 (opencv-python package), tqdm, PIL (Pillow package), numexpr, yaml (pyyaml package) and jsonschema. That's a lot of missing dependencies.

This got me to the point where the GPU was detected and it looks like it's working, but then, no.

Now to decypher all this (I wouldn't blame me for giving up at this point, but I'm feeling persistent, despite having a perfectly functioning copy of this on windows on the same machine.

[CPU] : CPU
  [0] : NVIDIA GeForce RTX 4090

[0] Which GPU indexes to choose? : 
0

Initializing models: 100%|####################################################################################################| 5/5 [00:02<00:00,  1.82it/s]
Loading samples: 100%|#################################################################################################| 1515/1515 [00:04<00:00, 377.60it/s]
Process Process-13:
Process Process-12:
Traceback (most recent call last):
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleGeneratorFace.py", line 140, in batch_func
    output_samples, random_flip = SampleProcessor.process ([sample], self.sample_process_options, self.output_sample_types, self.debug, ct_sample=ct_sample)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleProcessor.py", line 148, in process
    img = get_full_face_mask()
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleProcessor.py", line 71, in get_full_face_mask
    full_face_mask = LandmarksProcessor.get_image_hull_mask (sample_bgr.shape, sample_landmarks, eyebrows_expand_mod=sample.eyebrows_expand_mod )
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/facelib/LandmarksProcessor.py", line 391, in get_image_hull_mask
    lmrks = expand_eyebrows(image_landmarks, eyebrows_expand_mod)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/facelib/LandmarksProcessor.py", line 364, in expand_eyebrows
    lmrks = np.array( lmrks.copy(), dtype=np.int )
  File "/home/quentinj/anaconda3/envs/deepfacelab/lib/python3.9/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/quentinj/anaconda3/envs/deepfacelab/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/quentinj/anaconda3/envs/deepfacelab/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/core/joblib/SubprocessGenerator.py", line 54, in process_func
    gen_data = next (self.generator_func)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleGeneratorFace.py", line 142, in batch_func
    raise Exception ("Exception occured in sample %s. Error: %s" % (sample.filename, traceback.format_exc() ) )
Exception: Exception occured in sample /home/quentinj/DeepFaceLab_Linux/workspace/data_src/aligned/01383.jpg. Error: Traceback (most recent call last):
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleGeneratorFace.py", line 140, in batch_func
    output_samples, random_flip = SampleProcessor.process ([sample], self.sample_process_options, self.output_sample_types, self.debug, ct_sample=ct_sample)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleProcessor.py", line 148, in process
    img = get_full_face_mask()
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleProcessor.py", line 71, in get_full_face_mask
    full_face_mask = LandmarksProcessor.get_image_hull_mask (sample_bgr.shape, sample_landmarks, eyebrows_expand_mod=sample.eyebrows_expand_mod )
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/facelib/LandmarksProcessor.py", line 391, in get_image_hull_mask
    lmrks = expand_eyebrows(image_landmarks, eyebrows_expand_mod)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/facelib/LandmarksProcessor.py", line 364, in expand_eyebrows
    lmrks = np.array( lmrks.copy(), dtype=np.int )
  File "/home/quentinj/anaconda3/envs/deepfacelab/lib/python3.9/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Traceback (most recent call last):
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleGeneratorFace.py", line 140, in batch_func
    output_samples, random_flip = SampleProcessor.process ([sample], self.sample_process_options, self.output_sample_types, self.debug, ct_sample=ct_sample)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleProcessor.py", line 148, in process
    img = get_full_face_mask()
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleProcessor.py", line 71, in get_full_face_mask
    full_face_mask = LandmarksProcessor.get_image_hull_mask (sample_bgr.shape, sample_landmarks, eyebrows_expand_mod=sample.eyebrows_expand_mod )
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/facelib/LandmarksProcessor.py", line 391, in get_image_hull_mask
    lmrks = expand_eyebrows(image_landmarks, eyebrows_expand_mod)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/facelib/LandmarksProcessor.py", line 364, in expand_eyebrows
    lmrks = np.array( lmrks.copy(), dtype=np.int )
  File "/home/quentinj/anaconda3/envs/deepfacelab/lib/python3.9/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/quentinj/anaconda3/envs/deepfacelab/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/quentinj/anaconda3/envs/deepfacelab/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/core/joblib/SubprocessGenerator.py", line 54, in process_func
    gen_data = next (self.generator_func)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleGeneratorFace.py", line 142, in batch_func
    raise Exception ("Exception occured in sample %s. Error: %s" % (sample.filename, traceback.format_exc() ) )
Exception: Exception occured in sample /home/quentinj/DeepFaceLab_Linux/workspace/data_src/aligned/00658.jpg. Error: Traceback (most recent call last):
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleGeneratorFace.py", line 140, in batch_func
    output_samples, random_flip = SampleProcessor.process ([sample], self.sample_process_options, self.output_sample_types, self.debug, ct_sample=ct_sample)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleProcessor.py", line 148, in process
    img = get_full_face_mask()
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleProcessor.py", line 71, in get_full_face_mask
    full_face_mask = LandmarksProcessor.get_image_hull_mask (sample_bgr.shape, sample_landmarks, eyebrows_expand_mod=sample.eyebrows_expand_mod )
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/facelib/LandmarksProcessor.py", line 391, in get_image_hull_mask
    lmrks = expand_eyebrows(image_landmarks, eyebrows_expand_mod)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/facelib/LandmarksProcessor.py", line 364, in expand_eyebrows
    lmrks = np.array( lmrks.copy(), dtype=np.int )
  File "/home/quentinj/anaconda3/envs/deepfacelab/lib/python3.9/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Process Process-19:
Traceback (most recent call last):
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleGeneratorFace.py", line 140, in batch_func
    output_samples, random_flip = SampleProcessor.process ([sample], self.sample_process_options, self.output_sample_types, self.debug, ct_sample=ct_sample)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleProcessor.py", line 156, in process
    eye_mask = get_eyes_mask() * mask
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleProcessor.py", line 75, in get_eyes_mask
    eyes_mask = LandmarksProcessor.get_image_eye_mask (sample_bgr.shape, sample_landmarks)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/facelib/LandmarksProcessor.py", line 417, in get_image_eye_mask
    image_landmarks = image_landmarks.astype(np.int)
  File "/home/quentinj/anaconda3/envs/deepfacelab/lib/python3.9/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/quentinj/anaconda3/envs/deepfacelab/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/quentinj/anaconda3/envs/deepfacelab/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/core/joblib/SubprocessGenerator.py", line 54, in process_func
    gen_data = next (self.generator_func)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleGeneratorFace.py", line 142, in batch_func
    raise Exception ("Exception occured in sample %s. Error: %s" % (sample.filename, traceback.format_exc() ) )
Exception: Exception occured in sample 07453_0.jpg. Error: Traceback (most recent call last):
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleGeneratorFace.py", line 140, in batch_func
    output_samples, random_flip = SampleProcessor.process ([sample], self.sample_process_options, self.output_sample_types, self.debug, ct_sample=ct_sample)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleProcessor.py", line 156, in process
    eye_mask = get_eyes_mask() * mask
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/samplelib/SampleProcessor.py", line 75, in get_eyes_mask
    eyes_mask = LandmarksProcessor.get_image_eye_mask (sample_bgr.shape, sample_landmarks)
  File "/home/quentinj/DeepFaceLab_Linux/DeepFaceLab/facelib/LandmarksProcessor.py", line 417, in get_image_eye_mask
    image_landmarks = image_landmarks.astype(np.int)
  File "/home/quentinj/anaconda3/envs/deepfacelab/lib/python3.9/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Realising most of the above was just repeating I have deleted out the middle of this trace.

I tried downgrading numpy to 1.19.0 but then I get the below ERROR: Could not build wheels for numpy, which is required to install pyproject.toml-based projects

Which I understand is usually due to numpy not being compatible with the version of python installed.

So I think what this is all saying is that the DFL code needs to update it's reference to np.int and replace with the builtin int. So I'm guessing the only other way around this is to downgrade to something that doesn't care, which to date I have not been able to do, I think due to the 'modernness' of my OS or GPU.

Jul 21 '23 00:07 marshalleq

python3 -m pip install --upgrade pip python3 -m pip install tensorflow==2.13 python3 - pip install numpy==1.22 And it works!!!!

OK, I will try replicate this all from scratch. Thanks very much to @gteachey for the hints that enabled this! I just hope I can do it twice! LOL.

So it turns out ffmpeg package is missing too so: python3 -m pip install ffmpeg and python3 -m pip install ffmpeg-python (possibly only need this one)

Jul 21 '23 01:07 marshalleq

You're welcome! Glad to hear it's running now for you. Good to know for when I finally can move to a 40-series card my steps should work for them haha.

Jul 21 '23 21:07 gteachey

I'm not sure if it will work for you(I'm running a 3060) but YES THIS IS A HEADACHE.

I can't tell who to blame really either :) For me, I run a different process to make this work. Maybe it will activate yours too.

#Create Conda env with Python 3.9 conda create -n deepfacelab -c main -c "nvidia/label/cuda-11.8.0" -c conda-forge python=3.9 cudnn=8 cuda-toolkit=11.8

#Install tensorflow as tensorflow says to do it - https://www.tensorflow.org/install/pip#linux
conda install -c conda-forge cudatoolkit=11.8.0
python3 -m pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.12.*
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/:$CUDNN_PATH/lib:$LD_LIBRARY_PATH' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
# Verify install:
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
This should return a detected GPU, if not....something else is missing

Update the requirements.txt file:
colorama
ffmpeg-python==0.2.0
h5py==3.8.0
numexpr
numpy==1.22.3
opencv-python==4.7.0.72
pyqt5
scikit-image==0.21.0
scipy==1.10.1
tf2onnx==1.14.0
tqdm
Install the new requirements, and I'm able to start working pip install -r requirements-cuda.txt

good to know. I would like to try python3.8 or python 3.9 with cuda 11.8 later.

Reading the docs, obviously python/cuda is out of date for main stream.

Aug 02 '23 07:08 flydragon2018

@gteachey this has been working well until recently. I've tracked down an error that happens specifically when we use one part of your instruction: echo 'export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/:$CUDNN_PATH/lib:$LD_LIBRARY_PATH' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

There is a thread about it here:

Since you wrote it I thought perhaps you might have more of a clue, I have not been able to figure it out yet.

Thanks for any help.

Aug 25 '23 20:08 marshalleq

thx~

Sep 05 '23 13:09 tailangjun

Just want to echo that it is unfortunate the windows version works flawlessly and the linux version is a hot mess. Has anyone got this working reliably on an RTX 4090 using linux?

Nov 21 '23 16:11 antomicblitz

Ah I actually got it to work! My god! I am running an Nvidia RTX 4090 card on Nobara/Fedora 38. I will share what I did as soon as possible.

Nov 21 '23 17:11 antomicblitz

I have the following setup: Nobara/Fedora 38 Nvidia RTX 4090 Here is the hardware setup I have:

(base) [antonio@nobara-pc ~]$ nvidia-smi
Wed Nov 22 18:50:12 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:01:00.0 Off |                  Off |
| 30%   40C    P2             103W / 450W |   9476MiB / 24564MiB |     30%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

It was a bit difficult to remember exactly what I did to set this up, but I edited my requirements-cuda.txt dependencies to have the following:

colorama
ffmpeg-python==0.2.0
h5py==3.8.0
numexpr
numpy==1.20.0
opencv-python==4.7.0.72
pyqt5
scikit-image==0.19.3
scipy==1.7.3
tf2onnx==1.14.0
tqdm

Instead of trying to guess what else I installed, I simply exported the list of all installed packages in my deepfacelab Conda environment. Anyone should be able to install it using

conda env create -f environment.yml

here is the environment.yml file:

name: deepfacelab
channels:
  - nvidia/label/cuda-11.8.0
  - main
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - ca-certificates=2023.11.17=hbcca054_0
  - certifi=2023.11.17=pyhd8ed1ab_0
  - cuda-cccl=11.8.89=0
  - cuda-command-line-tools=11.8.0=0
  - cuda-compiler=11.8.0=0
  - cuda-cudart=11.8.89=0
  - cuda-cudart-dev=11.8.89=0
  - cuda-cuobjdump=11.8.86=0
  - cuda-cupti=11.8.87=0
  - cuda-cuxxfilt=11.8.86=0
  - cuda-documentation=11.8.86=0
  - cuda-driver-dev=11.8.89=0
  - cuda-gdb=11.8.86=0
  - cuda-libraries=11.8.0=0
  - cuda-libraries-dev=11.8.0=0
  - cuda-memcheck=11.8.86=0
  - cuda-nsight=11.8.86=0
  - cuda-nsight-compute=11.8.0=0
  - cuda-nvcc=11.8.89=0
  - cuda-nvdisasm=11.8.86=0
  - cuda-nvml-dev=11.8.86=0
  - cuda-nvprof=11.8.87=0
  - cuda-nvprune=11.8.86=0
  - cuda-nvrtc=11.8.89=0
  - cuda-nvrtc-dev=11.8.89=0
  - cuda-nvtx=11.8.86=0
  - cuda-nvvp=11.8.87=0
  - cuda-profiler-api=11.8.86=0
  - cuda-sanitizer-api=11.8.86=0
  - cuda-toolkit=11.8.0=0
  - cuda-tools=11.8.0=0
  - cuda-visual-tools=11.8.0=0
  - cudatoolkit=11.8.0=h6a678d5_0
  - cudnn=8.9.2.26=cuda11_0
  - gds-tools=1.4.0.31=0
  - ld_impl_linux-64=2.38=h1181459_1
  - libcublas=11.11.3.6=0
  - libcublas-dev=11.11.3.6=0
  - libcufft=10.9.0.58=0
  - libcufft-dev=10.9.0.58=0
  - libcufile=1.4.0.31=0
  - libcufile-dev=1.4.0.31=0
  - libcurand=10.3.0.86=0
  - libcurand-dev=10.3.0.86=0
  - libcusolver=11.4.1.48=0
  - libcusolver-dev=11.4.1.48=0
  - libcusparse=11.7.5.86=0
  - libcusparse-dev=11.7.5.86=0
  - libffi=3.4.4=h6a678d5_0
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libnpp=11.8.0.86=0
  - libnpp-dev=11.8.0.86=0
  - libnvjpeg=11.9.0.86=0
  - libnvjpeg-dev=11.9.0.86=0
  - libstdcxx-ng=11.2.0=h1234567_1
  - ncurses=6.4=h6a678d5_0
  - nsight-compute=2022.3.0.22=0
  - openssl=1.1.1w=h7f8727e_0
  - pip=22.3.1=py37h06a4308_0
  - python=3.7.16=h7a1cb2a_0
  - readline=8.2=h5eee18b_0
  - setuptools=65.6.3=py37h06a4308_0
  - sqlite=3.41.2=h5eee18b_0
  - tk=8.6.12=h1ccaba5_0
  - wheel=0.38.4=py37h06a4308_0
  - xz=5.4.2=h5eee18b_0
  - zlib=1.2.13=h5eee18b_0
  - pip:
      - absl-py==0.15.0
      - astunparse==1.6.3
      - cached-property==1.5.2
      - cachetools==5.3.2
      - charset-normalizer==3.3.2
      - clang==5.0
      - colorama==0.4.6
      - ffmpeg-python==0.2.0
      - flatbuffers==1.12
      - future==0.18.3
      - gast==0.4.0
      - google-auth==2.23.4
      - google-auth-oauthlib==0.4.6
      - google-pasta==0.2.0
      - grpcio==1.59.3
      - h5py==3.8.0
      - idna==3.4
      - imageio==2.31.2
      - importlib-metadata==6.7.0
      - keras==2.11.0
      - keras-preprocessing==1.1.2
      - markdown==3.4.4
      - markupsafe==2.1.3
      - networkx==2.6.3
      - numexpr==2.8.6
      - numpy==1.20.0
      - nvidia-cublas-cu11==2022.4.8
      - nvidia-cublas-cu117==11.10.1.25
      - nvidia-cudnn-cu11==8.6.0.163
      - oauthlib==3.2.2
      - onnx==1.14.1
      - opencv-python==4.7.0.72
      - opt-einsum==3.3.0
      - packaging==23.2
      - pillow==9.5.0
      - protobuf==3.20.3
      - pyasn1==0.5.1
      - pyasn1-modules==0.3.0
      - pyqt5==5.15.10
      - pyqt5-qt5==5.15.2
      - pyqt5-sip==12.13.0
      - python-version==0.0.2
      - pywavelets==1.3.0
      - requests==2.31.0
      - requests-oauthlib==1.3.1
      - rsa==4.9
      - scikit-image==0.19.3
      - scipy==1.7.3
      - six==1.15.0
      - tensorboard==2.11.2
      - tensorboard-data-server==0.6.1
      - tensorboard-plugin-wit==1.8.1
      - tensorflow==2.6.0
      - tensorflow-estimator==2.15.0
      - termcolor==1.1.0
      - tf2onnx==1.14.0
      - tifffile==2021.11.2
      - tqdm==4.66.1
      - typing-extensions==3.7.4.3
      - urllib3==2.0.7
      - werkzeug==2.2.3
      - wrapt==1.12.1
      - zipp==3.15.0
prefix: /home/antonio/.conda/envs/deepfacelab

Please let me know anyone can replicate the results. Anyway it looks like the main author of deepfacelab pretty much only updates stuff for windows and the linux version is in a state of total disarray.

Nov 22 '23 05:11 antomicblitz

Unworkable on modern cards e.g. 4090 or modern Distro's - Ubuntu 22.04