automatic-sem-image-segmentation W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "Conv2D"

Hi, I am trying to reproduce your experiment. I directly use the images and masks in your Archive/Automatic_SEM_Image_Segmentation directory. The environment and dependence is exactly the one listed in Releases/Version 1.1.1/requirements.txt . I use NVIDIA GeForce RTX 4090. However I met the error in the title in the second step "Simulating fake masks" when running python3 StartProcess.py. I tried using tensforflow 2.8 as suggested here however the issue still exists. Do you have any ideas how to fix it? Many thanks!

Jul 31 '24 03:07 longkunxuluke

Hi, I am sorry you're having trouble running the code. This error did not occur for us when we tested the code on our hardware, so I assume it is somewhat specific to your hardware and/or specific versions of your drivers and OS. The discussion in the link you posted seems to be about a slightly different error (op_level_cost_estimator.cc:689, in the heading of your post it says op_level_cost_estimator.cc:690). Here are some suggestions:

Can you please post the entire error message and stack trace, along with some information which OS and which NVIDIA driver version you are using?
Technically, it is only a "warning", not an error (warnings start with "W tensorflow[...]", errors with "E tensorflow[...]". Can you just ignore it? Or wrap it into a try-except statement if it actually throws an error?
Googling specifically for the 690 error also gives some results. Here, someone says that upgrading to tensorflow 2.15.1 solved the issue for them (I am not sure if my code will work with tensorflow 2.15, though, tensorflow tends to introduce breaking changes in their updates)
Someone else said that the error only occured when running the code locally for them. Could you try running it e.g., on google collab and see if the error persists?
Here, someone else said it might be related to CPU memory, not GPU memory. It might also be related to a Memory Leak Issue with tensorflow.
Can you try running it on a different PC (perhaps with a different OS, different NVIDIA driver version, etc)?
Can you maybe try a different combination of tensorflow, CUDA, and CuDNN (unfortuunately, not all combinations are compatible - I usually refer to this list, but you can experiment with different combinations and see what works)

Jul 31 '24 16:07 bruehle

Sorry, didn't mean to close the issue...

Jul 31 '24 16:07 bruehle

Hi, thanks for the detailed response.

Attached is the full log file. It seesms step0 and step1 can run normally however the error appears at step2. My machine is Ubuntu 22.04.4 LTS, and NVIDIA driver is 535.183.01.
Yes I understand it's just a warning but everytime the step2 of the code stops after the warning although step2 is not finished, which makes the following steps throwing errors, as you can see in the attached log file.
Thanks for the reference, I will give it a try and update whether tf2.15.1 or google colab or different OS/NVIDIA driver works.
I reckon this is likely the reason as my machine has only 64 CPU, I will check this.

Again, many thanks for the helps.

result-1058-20240731-withtf28.log

Aug 01 '24 02:08 longkunxuluke

Thank you for the additional information and uploading the log file. Since it only crashes 20 iterations into step 2 (and not immediately), it really might be related to some memory leak issue. Unfortunately, this seems to be a bug in tensorflow or a problem with one of the NVIDIA drivers or libraries and not with my code, so there is really not much I can do about that.

The only other thing that comes to mind (besides the things mentioned above) might be to try a smaller batch size for the WGAN. Currently, it uses 64, maybe try reducing it to 32 or 16 and see if that solves the issue (change the number line 28 in StartProcess.py)?

Sorry I can't be of more help with this issue.

Aug 01 '24 09:08 bruehle

Thanks for the help! Just to update my recent attempts (probably would be useful for future users). I tried decreasing a bunch of hyper-parameters including wgan.batch_size you suggested as well as others (wgan.n_z = 2, img_width) and it seems none of these can solve the issue. I will try using other tf/Nvidia driver versions.

Aug 05 '24 06:08 longkunxuluke

Hi, in case you are still interesed (or for anybody else comming here with the same issue: I just released a new version of the scripts (v 1.2.0), which use Keras v3. The big advantage is that you can very easily change the backend now from tensorflow to pytorch if tensorflow gives you trouble (it also runs with tensorflow 2.17.0 now, maybe that fixes things as well?).

Sep 07 '24 13:09 bruehle