PyCallChainRules.jl Errors running PyTorch GPU example on 1.8-beta3

Update: see below for error on lastest 1.8-beta3

During Installation

When trying to install Pytorch deps with GPU support, pip kept installing a different PyTorch version in the second command that installs functorch. This was fixed by installing both in one command: run(`$(PyCall.pyprogramname) -m pip install torch==1.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html functorch`).

isfunctional

At least on my version of CUDA.jl, isfunctional does not exist, only CUDA.functional() works

Error taking gradients when running the last line, i.e. grad = Zygote.gradient(m->loss(m, input, target), jlwrap)

This one I haven't been able to fix. Would be helpful to know if this is just on my machine.

Full stacktrace


julia> grad = Zygote.gradient(m->loss(m, input, target), jlwrap)
WARNING: Error while freeing DeviceBuffer(4 bytes at 0x0000000302005000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)
Stacktrace:
[1] throw_api_error(res::CUDA.cudaError_enum)
@ CUDA ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/error.jl:91
[2] macro expansion
@ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/error.jl:101 [inlined]
[3] cuMemFreeAsync(dptr::CUDA.Mem.DeviceBuffer, hStream::CuStream)
@ CUDA ~/.julia/packages/CUDA/5jdFl/lib/utils/call.jl:26
[4] #free#2
@ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/memory.jl:97 [inlined]
[5] macro expansion
@ ~/.julia/packages/CUDA/5jdFl/src/pool.jl:58 [inlined]
[6] macro expansion
@ ./timing.jl:359 [inlined]
[7] #actual_free#189
@ ~/.julia/packages/CUDA/5jdFl/src/pool.jl:57 [inlined]
[8] #_free#207
@ ~/.julia/packages/CUDA/5jdFl/src/pool.jl:375 [inlined]
[9] macro expansion
@ ~/.julia/packages/CUDA/5jdFl/src/pool.jl:340 [inlined]
[10] macro expansion
@ ./timing.jl:359 [inlined]
[11] #free#206
@ ~/.julia/packages/CUDA/5jdFl/src/pool.jl:339 [inlined]
[12] #212
@ ~/.julia/packages/CUDA/5jdFl/src/array.jl:79 [inlined]
[13] context!(f::CUDA.var"#212#213"{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuStream}, ctx::CuContext; skip_destroyed::Bool)
@ CUDA ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/state.jl:164
[14] unsafe_free!(xs::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, stream::CuStream)
@ CUDA ~/.julia/packages/CUDA/5jdFl/src/array.jl:78
[15] unsafe_finalize!(xs::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
@ CUDA ~/.julia/packages/CUDA/5jdFl/src/array.jl:99
[16] synchronize_cuda_tasks(ex::Expr)
@ CUDA ~/.julia/packages/CUDA/5jdFl/src/initialization.jl:50
[17] #invokelatest#2
@ ./essentials.jl:729 [inlined]
[18] invokelatest
@ ./essentials.jl:727 [inlined]
[19] eval_user_input(ast::Any, backend::REPL.REPLBackend)
@ REPL ~/.julia/juliaup/julia-1.8.0-beta1+0~x64/share/julia/stdlib/v1.8/REPL/src/REPL.jl:149
[20] repl_backend_loop(backend::REPL.REPLBackend)
@ REPL ~/.julia/juliaup/julia-1.8.0-beta1+0~x64/share/julia/stdlib/v1.8/REPL/src/REPL.jl:247
[21] start_repl_backend(backend::REPL.REPLBackend, consumer::Any)
@ REPL ~/.julia/juliaup/julia-1.8.0-beta1+0~x64/share/julia/stdlib/v1.8/REPL/src/REPL.jl:232
[22] run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool)
@ REPL ~/.julia/juliaup/julia-1.8.0-beta1+0~x64/share/julia/stdlib/v1.8/REPL/src/REPL.jl:369
[23] run_repl(repl::REPL.AbstractREPL, consumer::Any)
@ REPL ~/.julia/juliaup/julia-1.8.0-beta1+0~x64/share/julia/stdlib/v1.8/REPL/src/REPL.jl:356
[24] (::Base.var"#960#962"{Bool, Bool, Bool})(REPL::Module)
@ Base ./client.jl:419
[25] #invokelatest#2
@ ./essentials.jl:729 [inlined]
[26] invokelatest
@ ./essentials.jl:727 [inlined]
[27] run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)
@ Base ./client.jl:404
[28] exec_options(opts::Base.JLOptions)
@ Base ./client.jl:318
[29] _start()
@ Base ./client.jl:522
WARNING: Error while freeing DeviceBuffer(4.000 KiB at 0x0000000302004000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)
Stacktrace:
[1] throw_api_error(res::CUDA.cudaError_enum)
@ CUDA ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/error.jl:91
[2] macro expansion
@ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/error.jl:101 [inlined]
[3] cuMemFreeAsync(dptr::CUDA.Mem.DeviceBuffer, hStream::CuStream)
@ CUDA ~/.julia/packages/CUDA/5jdFl/lib/utils/call.jl:26
[4] #free#2
@ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/memory.jl:97 [inlined]
[5] macro expansion
@ ~/.julia/packages/CUDA/5jdFl/src/pool.jl:58 [inlined]
[6] macro expansion
@ ./timing.jl:359 [inlined]
[7] #actual_free#189
@ ~/.julia/packages/CUDA/5jdFl/src/pool.jl:57 [inlined]
[8] #_free#207
@ ~/.julia/packages/CUDA/5jdFl/src/pool.jl:375 [inlined]
[9] macro expansion
@ ~/.julia/packages/CUDA/5jdFl/src/pool.jl:340 [inlined]
[10] macro expansion
@ ./timing.jl:359 [inlined]
[11] #free#206
@ ~/.julia/packages/CUDA/5jdFl/src/pool.jl:339 [inlined]
[12] #212
@ ~/.julia/packages/CUDA/5jdFl/src/array.jl:79 [inlined]
[13] context!(f::CUDA.var"#212#213"{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuStream}, ctx::CuContext; skip_destroyed::Bool)
@ CUDA ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/state.jl:164
[14] unsafe_free!(xs::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, stream::CuStream)
@ CUDA ~/.julia/packages/CUDA/5jdFl/src/array.jl:78
[15] unsafe_finalize!(xs::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
@ CUDA ~/.julia/packages/CUDA/5jdFl/src/array.jl:99
[16] synchronize_cuda_tasks(ex::Expr)
@ CUDA ~/.julia/packages/CUDA/5jdFl/src/initialization.jl:50
[17] #invokelatest#2
@ ./essentials.jl:729 [inlined]
[18] invokelatest
@ ./essentials.jl:727 [inlined]
[19] eval_user_input(ast::Any, backend::REPL.REPLBackend)
@ REPL ~/.julia/juliaup/julia-1.8.0-beta1+0~x64/share/julia/stdlib/v1.8/REPL/src/REPL.jl:149
[20] repl_backend_loop(backend::REPL.REPLBackend)
@ REPL ~/.julia/juliaup/julia-1.8.0-beta1+0~x64/share/julia/stdlib/v1.8/REPL/src/REPL.jl:247
[21] start_repl_backend(backend::REPL.REPLBackend, consumer::Any)
@ REPL ~/.julia/juliaup/julia-1.8.0-beta1+0~x64/share/julia/stdlib/v1.8/REPL/src/REPL.jl:232
[22] run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool)
@ REPL ~/.julia/juliaup/julia-1.8.0-beta1+0~x64/share/julia/stdlib/v1.8/REPL/src/REPL.jl:369
[23] run_repl(repl::REPL.AbstractREPL, consumer::Any)
@ REPL ~/.julia/juliaup/julia-1.8.0-beta1+0~x64/share/julia/stdlib/v1.8/REPL/src/REPL.jl:356
[24] (::Base.var"#960#962"{Bool, Bool, Bool})(REPL::Module)
@ Base ./client.jl:419
[25] #invokelatest#2
@ ./essentials.jl:729 [inlined]
[26] invokelatest
@ ./essentials.jl:727 [inlined]
[27] run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)
@ Base ./client.jl:404
[28] exec_options(opts::Base.JLOptions)
@ Base ./client.jl:318
[29] _start()
@ Base ./client.jl:522
ERROR: PyError ($(Expr(:escape, :(ccall(#= /home/lorenz/.julia/packages/PyCall/7a7w0/src/pyfncall.jl:43 =# @pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, pyargsptr, kw))))) <class 'RuntimeError'>
RuntimeError('CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.')
File "/home/lorenz/anaconda3/envs/pycall/lib/python3.8/site-packages/functorch/_src/eager_transforms.py", line 243, in vjp
primals_out = func(*diff_primals)
File "/home/lorenz/.julia/packages/PyCall/7a7w0/src/pyeval.jl", line 3, in newfn
const Py_eval_input = 258
File "/home/lorenz/anaconda3/envs/pycall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lorenz/anaconda3/envs/pycall/lib/python3.8/site-packages/functorch/_src/make_functional.py", line 259, in forward
return self.stateless_model(*args, **kwargs)
File "/home/lorenz/anaconda3/envs/pycall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lorenz/anaconda3/envs/pycall/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
Stacktrace:
[1] pyerr_check
@ ~/.julia/packages/PyCall/7a7w0/src/exception.jl:62 [inlined]
[2] pyerr_check
@ ~/.julia/packages/PyCall/7a7w0/src/exception.jl:66 [inlined]
[3] _handle_error(msg::String)
@ PyCall ~/.julia/packages/PyCall/7a7w0/src/exception.jl:83
[4] macro expansion
@ ~/.julia/packages/PyCall/7a7w0/src/exception.jl:97 [inlined]
[5] #107
@ ~/.julia/packages/PyCall/7a7w0/src/pyfncall.jl:43 [inlined]
[6] disable_sigint
@ ./c.jl:473 [inlined]
[7] __pycall!
@ ~/.julia/packages/PyCall/7a7w0/src/pyfncall.jl:42 [inlined]
[8] _pycall!(ret::PyObject, o::PyObject, args::Tuple{PyObject, Tuple{PyObject, PyObject}, PyObject}, nargs::Int64, kw::Ptr{Nothing})
@ PyCall ~/.julia/packages/PyCall/7a7w0/src/pyfncall.jl:29
[9] _pycall!(ret::PyObject, o::PyObject, args::Tuple{PyObject, Tuple{PyObject, PyObject}, PyObject}, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ PyCall ~/.julia/packages/PyCall/7a7w0/src/pyfncall.jl:11
[10] (::PyObject)(::PyObject, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ PyCall ~/.julia/packages/PyCall/7a7w0/src/pyfncall.jl:86
[11] (::PyObject)(::PyObject, ::Vararg{Any})
@ PyCall ~/.julia/packages/PyCall/7a7w0/src/pyfncall.jl:86
[12] rrule(wrap::TorchModuleWrapper, args::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ PyCallChainRules.Torch ~/.julia/packages/PyCallChainRules/Vrwrg/src/pytorch.jl:65
[13] rrule
@ ~/.julia/packages/PyCallChainRules/Vrwrg/src/pytorch.jl:60 [inlined]
[14] rrule
@ ~/.julia/packages/ChainRulesCore/IzITE/src/rules.jl:134 [inlined]
[15] chain_rrule
@ ~/.julia/packages/Zygote/H6vD3/src/compiler/chainrules.jl:216 [inlined]
[16] macro expansion
@ ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0 [inlined]
[17] _pullback(ctx::Zygote.Context, f::TorchModuleWrapper, args::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
@ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:9
[18] _pullback
@ ./REPL[31]:1 [inlined]
[19] _pullback(::Zygote.Context, ::typeof(loss), ::TorchModuleWrapper, ::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
@ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0
[20] _pullback
@ ./REPL[35]:1 [inlined]
[21] _pullback(ctx::Zygote.Context, f::var"#11#12", args::TorchModuleWrapper)
@ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0
[22] _pullback(f::Function, args::TorchModuleWrapper)
@ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:34
[23] pullback(f::Function, args::TorchModuleWrapper)
@ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:40
[24] gradient(f::Function, args::TorchModuleWrapper)
@ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:75
[25] top-level scope
@ REPL[35]:1
ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
[1] throw_api_error(res::CUDA.cudaError_enum)
@ CUDA ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/error.jl:91
[2] isdone
@ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/stream.jl:109 [inlined]
[3] nonblocking_synchronize
@ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/stream.jl:139 [inlined]
[4] nonblocking_synchronize
@ ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/context.jl:325 [inlined]
[5] device_synchronize()
@ CUDA ~/.julia/packages/CUDA/5jdFl/lib/cudadrv/context.jl:319
[6] top-level scope
@ ~/.julia/packages/CUDA/5jdFl/src/initialization.jl:54

Output of CUDA.versioninfo()

CUDA toolkit 11.6, artifact installation NVIDIA driver 470.103.1, for CUDA 11.4 CUDA driver 11.4

Libraries:

CUBLAS: 11.8.1
CURAND: 10.2.9
CUFFT: 10.7.0
CUSOLVER: 11.3.2
CUSPARSE: 11.7.1
CUPTI: 16.0.0
NVML: 11.0.0+470.103.1
CUDNN: 8.30.2 (for CUDA 11.5.0)
CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:

Julia: 1.8.0-beta1
LLVM: 13.0.1
PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device: 0: NVIDIA GeForce GTX 1080 Ti (sm_61, 10.067 GiB / 10.915 GiB available)

Mar 18 '22 09:03 lorenzoh

Hi @lorenzoh! Thanks for checking in! Would it be possible for you to test on v1.7/v1.6 instead? I think there are still quite some issues with GPUCompiler.jl that haven't landed in 1.8 beta-1 release that do effect CUDA. I am not even sure if PyCall.jl works correctly on v1.8 either. I'm sure over the v1.8 release window we will resolve these issues, but just wanted to confirm at least Julia v1.6/v1.7 work.

Mar 18 '22 16:03 rejuvyesh

Fixed the typo in README in 8370fbb0a84f74b13f0134070217f0dfba08ad2c. Should figure out how to run these as tests and probably should move to something like Documenter for proper doctests.

Mar 18 '22 16:03 rejuvyesh

Ah, I see, will try on 1.7 for sure! :+1:

In any case, I am super excited about this, since this seems like a solid approach for a PyTorch backend to Flux.jl packages like FastAI.jl :)

Also, yeah, having these as tests is a good idea, but (at least on the CI) GPU examples won't work

Mar 18 '22 17:03 lorenzoh

Have tried with Julia 1.7.2, and the CUDA works just fine :rainbow: In case you want to track progress on 1.8 compatibility I'll keep this open, but feel free to close this issue.

Mar 18 '22 17:03 lorenzoh

Great! I'll keep this open to track julia v1.8 issues but looking forward to the pytorch backend updates!

Mar 18 '22 18:03 rejuvyesh

Short update; tried on 1.8.0-beta3 and am getting a different error. I am not really sure where to start debugging this, but putting it here for completeness.

Stacktrace

julia> grad, = Zygote.gradient(m->loss(m, input, target), jlwrap)
ERROR: MethodError: no method matching return_types(::GPUArrays.var"#broadcast_kernel#17", ::Type{Tuple{CUDA.CuKernelContext, CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(-), Tuple{Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}, ::GPUCompiler.GPUInterpreter)
Closest candidates are:
  return_types(::Any, ::Any; world, interp) at reflection.jl:1294
  return_types(::Any) at reflection.jl:1294
Stacktrace:
  [1] check_method(job::GPUCompiler.CompilerJob)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/I9fZc/src/validation.jl:19
  [2] macro expansion
    @ ~/.julia/packages/TimerOutputs/nDhDw/src/TimerOutput.jl:252 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/GPUCompiler/I9fZc/src/driver.jl:89 [inlined]
  [4] emit_julia(job::GPUCompiler.CompilerJob)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/I9fZc/src/utils.jl:64
  [5] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/5jdFl/src/compiler/execution.jl:324
  [6] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/I9fZc/src/cache.jl:90
  [7] cufunction(f::GPUArrays.var"#broadcast_kernel#17", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(-), Tuple{Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/5jdFl/src/compiler/execution.jl:297
  [8] cufunction
    @ ~/.julia/packages/CUDA/5jdFl/src/compiler/execution.jl:291 [inlined]
  [9] macro expansion
    @ ~/.julia/packages/CUDA/5jdFl/src/compiler/execution.jl:102 [inlined]
 [10] #launch_heuristic#282
    @ ~/.julia/packages/CUDA/5jdFl/src/gpuarrays.jl:17 [inlined]
 [11] _copyto!
    @ ~/.julia/packages/GPUArrays/VNhDf/src/host/broadcast.jl:73 [inlined]
 [12] copyto!
    @ ~/.julia/packages/GPUArrays/VNhDf/src/host/broadcast.jl:56 [inlined]
 [13] copy
    @ ~/.julia/packages/GPUArrays/VNhDf/src/host/broadcast.jl:47 [inlined]
 [14] materialize
    @ ./broadcast.jl:860 [inlined]
 [15] adjoint
    @ ~/.julia/packages/Zygote/H6vD3/src/lib/broadcast.jl:77 [inlined]
 [16] _pullback(__context__::Zygote.Context, 676::typeof(Base.Broadcast.broadcasted), 677::typeof(-), x::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, y::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/ZygoteRules/AIbCs/src/adjoint.jl:65
 [17] _pullback
    @ ./REPL[39]:1 [inlined]
 [18] _pullback(::Zygote.Context, ::typeof(loss), ::TorchModuleWrapper, ::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0
 [19] _pullback
    @ ./REPL[40]:1 [inlined]
 [20] _pullback(ctx::Zygote.Context, f::var"#3#4", args::TorchModuleWrapper)
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0
 [21] _pullback(f::Function, args::TorchModuleWrapper)
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:34
 [22] pullback(f::Function, args::TorchModuleWrapper)
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:40
 [23] gradient(f::Function, args::TorchModuleWrapper)
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:75
 [24] top-level scope
    @ REPL[40]:1
 [25] top-level scope
    @ ~/.julia/packages/CUDA/5jdFl/src/initialization.jl:52

Apr 27 '22 08:04 lorenzoh

All tests are passing for me locally. Hmm. What's your CUDA version?

julia> CUDA.versioninfo()
CUDA toolkit 11.6, artifact installation
NVIDIA driver 510.47.3, for CUDA 11.6
CUDA driver 11.6

Libraries: 
- CUBLAS: 11.5.1
- CURAND: 10.2.9
- CUFFT: 10.7.0
- CUSOLVER: 11.3.2
- CUSPARSE: 11.7.1
- CUPTI: 16.0.0
- NVML: 11.0.0+510.47.3
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.0-beta3
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: Tesla V100-PCIE-16GB (sm_70, 14.382 GiB / 16.000 GiB available)

Apr 27 '22 18:04 rejuvyesh

julia> using CUDA

julia> CUDA.versioninfo()
CUDA toolkit 11.6, artifact installation
NVIDIA driver 470.103.1, for CUDA 11.4
CUDA driver 11.4

Libraries: 
- CUBLAS: 11.8.1
- CURAND: 10.2.9
- CUFFT: 10.7.0
- CUSOLVER: 11.3.2
- CUSPARSE: 11.7.1
- CUPTI: 16.0.0
- NVML: 11.0.0+470.103.1
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.0-beta3
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: NVIDIA GeForce GTX 1080 Ti (sm_61, 10.234 GiB / 10.915 GiB available)

Seems I have CUDA 11.4 while you have 11.6. Do you know how to update this specifically? Then I could test again

Apr 27 '22 19:04 lorenzoh

Just Pkg.update() should fix this? I definitely know there have been a bunch of changes in GPUCompiler lately. Would it be possible try this in a new project/environment if your current environment is not letting you upgrade?

Apr 27 '22 20:04 rejuvyesh

Will try to update and rerun tomorrow 👍 Did the loss happen to go down when running for longer?

Thanks for doing the testing!

Apr 27 '22 21:04 lorenzoh

Did the loss happen to go down when running for longer?

It didn't. But that's not surprising given #19? The computed gradients have a weird structure that doesn't work well with Flux's implicit parameters. But should work with Optimisers.jl.

Apr 27 '22 23:04 rejuvyesh

Ah, I see! Will try out with Optimisers.jl.

Apr 28 '22 06:04 lorenzoh