FastStyle icon indicating copy to clipboard operation
FastStyle copied to clipboard

Where is output from training?

Open moldach opened this issue 4 years ago • 7 comments

I've ran the following training script. There doesn't seem to be an obvious errors from the logs, so I think it ran successfully - just having trouble finding the output to use for evaluation now:

Script:

#!/bin/bash
#SBATCH --job-name=train-pytorch
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --ntasks=1
#SBATCH --time=12:00:00
#SBATCH --mem=8000
#SBATCH --gres=gpu:p100:2
#SBATCH --cpus-per-task=6
#SBATCH --output=%x_%j.log
#SBATCH --error=%x_%j.err

source tensorflow/bin/activate

python main.py train \
  --style /scratch/moldach/PyTorch-Style-Transfer/experiments/images/matts-styles/birmingham.jpg \
  --dataset datasets/train2014 \
  --weights imagenet-vgg-verydeep-19.mat

I get the following logs:

.err

021-03-29 21:55:54.026157: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-03-29 22:03:24.188858: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-03-29 22:03:24.939154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:03:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-03-29 22:03:24.947032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties: 
pciBusID: 0000:04:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-03-29 22:03:24.963393: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-03-29 22:03:25.011845: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-03-29 22:03:25.036597: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-03-29 22:03:25.051713: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-03-29 22:03:25.071691: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-03-29 22:03:25.076390: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-03-29 22:03:25.147072: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-03-29 22:03:25.206992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1
2021-03-29 22:03:25.438640: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-03-29 22:03:28.342003: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2200150000 Hz
2021-03-29 22:03:28.588715: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x59faef0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-29 22:03:28.588828: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-03-29 22:03:29.246878: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5a880d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-03-29 22:03:29.246987: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla P100-PCIE-12GB, Compute Capability 6.0
2021-03-29 22:03:29.247029: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla P100-PCIE-12GB, Compute Capability 6.0
2021-03-29 22:03:29.382411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:03:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-03-29 22:03:29.384001: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties: 
pciBusID: 0000:04:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-03-29 22:03:29.384077: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-03-29 22:03:29.384136: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-03-29 22:03:29.384225: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-03-29 22:03:29.384278: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-03-29 22:03:29.384323: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-03-29 22:03:29.384367: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-03-29 22:03:29.384412: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-03-29 22:03:29.390742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1
2021-03-29 22:03:29.390834: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-03-29 22:03:33.204879: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-29 22:03:33.204990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 1 
2021-03-29 22:03:33.205025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N Y 
2021-03-29 22:03:33.205043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 1:   Y N 
2021-03-29 22:03:33.361079: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11121 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:03:00.0, compute capability: 6.0)
2021-03-29 22:03:33.424735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11121 MB memory) -> physical GPU (device: 1, name: Tesla P100-PCIE-12GB, pci bus id: 0000:04:00.0, compute capability: 6.0)
2021-03-29 22:03:47.740068: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-03-29 22:03:48.882829: W tensorflow/stream_executor/gpu/asm_compiler.cc:81] Running ptxas --version returned 256
2021-03-29 22:03:49.009004: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 256, output: 
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
2021-03-29 22:03:50.058719: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-03-29 22:17:30.953810: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 158 of 1024
2021-03-29 22:17:41.277361: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 333 of 1024
2021-03-29 22:17:51.193145: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 528 of 1024
2021-03-29 22:18:01.484417: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 661 of 1024
2021-03-29 22:18:11.074078: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 859 of 1024
2021-03-29 22:18:20.985531: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:221] Shuffle buffer filled.
2021-03-29 23:59:09.658659: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 251 of 1024
2021-03-29 23:59:19.551394: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 563 of 1024
2021-03-29 23:59:29.764089: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 806 of 1024
2021-03-29 23:59:39.467575: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 1011 of 1024
2021-03-29 23:59:40.309649: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:221] Shuffle buffer filled.

.log

Epoch 0
=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

Epoch 1
=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

Total time: 12052.2
=====================================
             All saved!              
=====================================

moldach avatar Mar 30 '21 14:03 moldach

Hi @moldach , you can use the following command to transfer a content image

python main.py evaluate --content ./path/to/content/image.jpg   \
                        --weights ./path/to/weights \
                        --result ./path/to/save/results/image.jpg

cryu854 avatar Mar 30 '21 15:03 cryu854

From the above example, you provided it's not clear how I'm transferring my style to a content image. I can see the --content param where I can provide either a .mp4/.jpg but it's not clear where I should put style?

In the README it says:

Models for evaluation are located here https://drive.google.com/drive/folders/1-ywa__KcK4uEEYOzgfeRCpCzP3RJKBwL Example usage:

python main.py evaluate    \
  --weights ./path/to/weights \
  --content ./path/to/content/image.jpg(video.mp4) \
  --result ./path/to/save/results/image.jpg
  

It's not clear to me if you are using weights & models interchangeably? What am I supposed to put in the --weights param here?

Just to be 100% clear I do not want to use your models (e.g. wave - I would like to train a new model based on an image of my choice.

That's why I thought you need to Use main.py to train a new style transfer network first?

python main.py train    \
  --style ./path/to/style/image.jpg \
  --dataset ./path/to/dataset \
  --weights ./path/to/weights \
  --batch 2    
  

Is the --weights here actually the model output? Should I not be providing the pre-trained checkpoint imagenet-vgg-verydeep-19.mat there?

Your help is much appreciated :)

moldach avatar Mar 30 '21 17:03 moldach

Once you have trained the model based on your style image, you will get a new model named imagenet-vgg-verydeep-19.mat for your case. If you want to stylize the content image, you just need to put the path of that model in the --weights parameter.

cryu854 avatar Mar 30 '21 17:03 cryu854

I have the same question, i want to train a custom model and i dont see any output.

In you available models for download i see the checkpoints, etc but when i train not. You say you will get a new mat file...

python3 ./FastStyle/main.py train --style ./source/art.jpg --dataset ./dataset/train2014 --weights ./weights/ --batch 2

but i get no mat output or checkpoints or any error, just msg on screen and one at the end that says All saved!

what im doing wrong¿

Thanks

EDIT: i see in the folder a 56kb checkpoint file

natxopedreira avatar Jul 08 '21 07:07 natxopedreira

I think i found the issue, i read somewhere in the docs that path is also used as name for checkpoints!!! so if you add a path it can not be used for a name.... so i dont understand how can you add a path and a name.

Anyway just saying --weights points or whatever will create the checkpoints correctly

natxopedreira avatar Jul 08 '21 10:07 natxopedreira

python main.py train
--style /scratch/moldach/PyTorch-Style-Transfer/experiments/images/matts-styles/birmingham.jpg
--dataset datasets/train2014
--weights imagenet-vgg-verydeep-19.mat

But that imagenet-vgg-verydeep-19.mat should not be in there or anywhere in this project , weights should be a empty folder where will be saved the train progres so can be used later , i don't see nowhere matlab is needed on this project so that .mat file is useless ...

KlausStortebeker avatar Sep 17 '21 15:09 KlausStortebeker

I think i found the issue, i read somewhere in the docs that path is also used as name for checkpoints!!! so if you add a path it can not be used for a name.... so i dont understand how can you add a path and a name.

Anyway just saying --weights points or whatever will create the checkpoints correctly

--weights in training is used to save checkpoints , in evaluate to read them , so in training should be a empty folder in evaluate should have the "pretrained models" content in it .

If you don't want to save in something like --weights ./weights/1 should be --weights ./weights without slash .

KlausStortebeker avatar Sep 17 '21 16:09 KlausStortebeker