How to use more local parameters with a new CUDA device?
Description and motivation
@maljoras Thanks for answering all my questions. As mentioned in previous issues, I am currently trying to develop a new device model for aihwkit. The CPU version seems to be working now, and I am moving on to the CUDA version.
My device model requires 4 local (device-specific) variables to enable dtod variations. In the CPU version, this is rather straightforward, I simply need to declare more matrices to store them:
private: T **device_specific_Ndiscmax = nullptr; T **device_specific_Ndiscmin = nullptr; T **device_specific_ldet = nullptr; T **device_specific_A = nullptr;
However, In the CUDA version, this is not possible, as the device update function only supports 2 device-specific parameters, wrapped in par_2.
Proposed solution
I tried to trace this down the road, and found out that I need to override [this macro] to cast params_2 differently.(https://github.com/IBM/aihwkit/blob/master/src/rpucuda/cuda/pwu_kernel.h#L236-L248)
#define RPU_FUNCTOR_LOAD_PARAMS \
{ \
w = weights[idx]; \
if (params != nullptr) { \
par_4 = reinterpret_cast(params)[idx]; \
} \
if (params_2 != nullptr) { \
par_2 = reinterpret_cast(params_2)[idx]; \
} \
if (use_par_1) { \
par_1 = params_1[idx]; \
} \
}
Is this even possible without a significant change to the code structure?
If I do change par_2 to hold 4 float values, are there any other functions that I need to re-implement other than the operator() function inside each device?
Alternatives and other information
I could also try to swap out some part of par_4 to hold my parameters.
For this, I only need to overwrite the values defined in PulsedRPUDeviceCudaBase<T>::populateFrom(), which is easy to do in the HOST_COPY_BODY part of BUILD_PULSED_DEVICE_CONSTRUCTORS_CUDA.
The wmax and wmax inside par_4 was easy to understand.
But I wonder what is the function of scale_up and scale_down inside par_4.
Do they got transported from here and only control the step size for different directions?
If so I should be able to re-use the space for my stuff instead.
Thanks a lot for all the help in this!
The scale_down and scale_up parameters are the minimal update step sizes with device to device variations (see e.g. how it is used in case of the linear step device). If you do not use these in your update function you could indeed overwrite them. In this case, you need to provide an implementation of the Functor update_once. So you still use the w_max and w_min (keep it as is and use it in the update_once and just use the slots of the scale_up and scale_down) and par_2 for the other two. In this case you can re-use all the other update functionality (in particular you can use the Functor version, as done e.g. here). If you need more device-to-device parameters, indeed many functions need to be adapted and it is more complicated.
Regarding the CPU version, make however sure that you can get/set the additional parameter with the get_device_parameters function. Otherwise checkpointing will not work properly.
Dear @maljoras
Upon implementing the CUDA version, I discovered a new function that is not defined in the CPU code: The applyWeightUpdate function does not work with Persistent Weight.
As you suggested before in https://github.com/IBM/aihwkit/issues/379#issuecomment-1134839373_, I am using the w_persistent_ as my hidden weight that determines the real weight.
So, how does the applyWeightUpdate function works, and do I need to modify that according to my need?
Thanks a lot!
Zhenming
Dear @maljoras
Upon implementing the CUDA version, I discovered a new function that is not defined in the CPU code: The applyWeightUpdate function does not work with Persistent Weight.
As you suggested before in #379 (comment)_, I am using the
w_persistent_as my hidden weight that determines the real weight.So, how does the
applyWeightUpdatefunction works, and do I need to modify that according to my need?Thanks a lot!
Zhenming
I see that applyWeightUpdate add dw_and_current_weight_out to weights, and then check whether weights is within bounds.
But what function does this serves? Isn't the update calculated by converting dw into pulse counts with learning rate and dw_min? And then apply the update cycle several times?
Why is there a shortcut that does not consider the device dynamics?
Is this for the FloatingPointDevice? But why is it in rpucuda_pulsed_device files?
Thanks for your time.
Zhenming
Hi @ZhenmingYu:
The applyWeightUpdate function is essentially not used currently (and not exposed to pytorch) and is only an experimental mechanism we experimented with from C++ to simulate (approximative) data-parallel training for the base device. However, since this cannot be naturally supported for most analog training devices, you can just ignore that function (or if you are worried that someone might call it, just override the function in your device and call RPU_NOT_IMPLEMENTED to always cause a runtime error when accidentally used). In fact, I should probably remove that function.