Where is output from training?
I've ran the following training script. There doesn't seem to be an obvious errors from the logs, so I think it ran successfully - just having trouble finding the output to use for evaluation now:
Script:
#!/bin/bash
#SBATCH --job-name=train-pytorch
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --ntasks=1
#SBATCH --time=12:00:00
#SBATCH --mem=8000
#SBATCH --gres=gpu:p100:2
#SBATCH --cpus-per-task=6
#SBATCH --output=%x_%j.log
#SBATCH --error=%x_%j.err
source tensorflow/bin/activate
python main.py train \
--style /scratch/moldach/PyTorch-Style-Transfer/experiments/images/matts-styles/birmingham.jpg \
--dataset datasets/train2014 \
--weights imagenet-vgg-verydeep-19.mat
I get the following logs:
.err
021-03-29 21:55:54.026157: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-03-29 22:03:24.188858: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-03-29 22:03:24.939154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:03:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-03-29 22:03:24.947032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties:
pciBusID: 0000:04:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-03-29 22:03:24.963393: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-03-29 22:03:25.011845: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-03-29 22:03:25.036597: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-03-29 22:03:25.051713: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-03-29 22:03:25.071691: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-03-29 22:03:25.076390: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-03-29 22:03:25.147072: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-03-29 22:03:25.206992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1
2021-03-29 22:03:25.438640: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-03-29 22:03:28.342003: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2200150000 Hz
2021-03-29 22:03:28.588715: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x59faef0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-29 22:03:28.588828: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-03-29 22:03:29.246878: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5a880d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-03-29 22:03:29.246987: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla P100-PCIE-12GB, Compute Capability 6.0
2021-03-29 22:03:29.247029: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): Tesla P100-PCIE-12GB, Compute Capability 6.0
2021-03-29 22:03:29.382411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:03:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-03-29 22:03:29.384001: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties:
pciBusID: 0000:04:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-03-29 22:03:29.384077: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-03-29 22:03:29.384136: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-03-29 22:03:29.384225: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-03-29 22:03:29.384278: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-03-29 22:03:29.384323: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-03-29 22:03:29.384367: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-03-29 22:03:29.384412: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-03-29 22:03:29.390742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1
2021-03-29 22:03:29.390834: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-03-29 22:03:33.204879: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-29 22:03:33.204990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0 1
2021-03-29 22:03:33.205025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N Y
2021-03-29 22:03:33.205043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 1: Y N
2021-03-29 22:03:33.361079: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11121 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:03:00.0, compute capability: 6.0)
2021-03-29 22:03:33.424735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11121 MB memory) -> physical GPU (device: 1, name: Tesla P100-PCIE-12GB, pci bus id: 0000:04:00.0, compute capability: 6.0)
2021-03-29 22:03:47.740068: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-03-29 22:03:48.882829: W tensorflow/stream_executor/gpu/asm_compiler.cc:81] Running ptxas --version returned 256
2021-03-29 22:03:49.009004: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 256, output:
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
2021-03-29 22:03:50.058719: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-03-29 22:17:30.953810: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 158 of 1024
2021-03-29 22:17:41.277361: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 333 of 1024
2021-03-29 22:17:51.193145: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 528 of 1024
2021-03-29 22:18:01.484417: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 661 of 1024
2021-03-29 22:18:11.074078: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 859 of 1024
2021-03-29 22:18:20.985531: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:221] Shuffle buffer filled.
2021-03-29 23:59:09.658659: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 251 of 1024
2021-03-29 23:59:19.551394: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 563 of 1024
2021-03-29 23:59:29.764089: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 806 of 1024
2021-03-29 23:59:39.467575: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 1011 of 1024
2021-03-29 23:59:40.309649: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:221] Shuffle buffer filled.
.log
Epoch 0
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
Epoch 1
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
=====================================
Weights saved!
=====================================
Total time: 12052.2
=====================================
All saved!
=====================================
Hi @moldach , you can use the following command to transfer a content image
python main.py evaluate --content ./path/to/content/image.jpg \
--weights ./path/to/weights \
--result ./path/to/save/results/image.jpg
From the above example, you provided it's not clear how I'm transferring my style to a content image.
I can see the --content param where I can provide either a .mp4/.jpg but it's not clear where I should put style?
In the README it says:
Models for evaluation are located here https://drive.google.com/drive/folders/1-ywa__KcK4uEEYOzgfeRCpCzP3RJKBwL Example usage:
python main.py evaluate \
--weights ./path/to/weights \
--content ./path/to/content/image.jpg(video.mp4) \
--result ./path/to/save/results/image.jpg
It's not clear to me if you are using weights & models interchangeably?
What am I supposed to put in the --weights param here?
Just to be 100% clear I do not want to use your models (e.g. wave - I would like to train a new model based on an image of my choice.
That's why I thought you need to Use main.py to train a new style transfer network first?
python main.py train \
--style ./path/to/style/image.jpg \
--dataset ./path/to/dataset \
--weights ./path/to/weights \
--batch 2
Is the --weights here actually the model output?
Should I not be providing the pre-trained checkpoint imagenet-vgg-verydeep-19.mat there?
Your help is much appreciated :)
Once you have trained the model based on your style image, you will get a new model named imagenet-vgg-verydeep-19.mat for your case.
If you want to stylize the content image, you just need to put the path of that model in the --weights parameter.
I have the same question, i want to train a custom model and i dont see any output.
In you available models for download i see the checkpoints, etc but when i train not. You say you will get a new mat file...
python3 ./FastStyle/main.py train --style ./source/art.jpg --dataset ./dataset/train2014 --weights ./weights/ --batch 2
but i get no mat output or checkpoints or any error, just msg on screen and one at the end that says All saved!
what im doing wrong¿
Thanks
EDIT: i see in the folder a 56kb checkpoint file
I think i found the issue, i read somewhere in the docs that path is also used as name for checkpoints!!! so if you add a path it can not be used for a name.... so i dont understand how can you add a path and a name.
Anyway just saying --weights points or whatever will create the checkpoints correctly
python main.py train
--style /scratch/moldach/PyTorch-Style-Transfer/experiments/images/matts-styles/birmingham.jpg
--dataset datasets/train2014
--weights imagenet-vgg-verydeep-19.mat
But that imagenet-vgg-verydeep-19.mat should not be in there or anywhere in this project , weights should be a empty folder where will be saved the train progres so can be used later , i don't see nowhere matlab is needed on this project so that .mat file is useless ...
I think i found the issue, i read somewhere in the docs that path is also used as name for checkpoints!!! so if you add a path it can not be used for a name.... so i dont understand how can you add a path and a name.
Anyway just saying --weights points or whatever will create the checkpoints correctly
--weights in training is used to save checkpoints , in evaluate to read them , so in training should be a empty folder in evaluate should have the "pretrained models" content in it .
If you don't want to save in something like --weights ./weights/1 should be --weights ./weights without slash .