[Question] : calculate output error in backward should be partial of activation function rather than activation function itself ?

Open ZJLi2013 opened this issue 1 year ago • 0 comments

hi, team, in fully_fused_mlp.cu , the following looks not understandable:

	// If the output width is larger than 16 dims, we use cutlass to backpropagate through the last layer
	// rather than fusing it with our kernel.
	if (m_output_width > 16) {
		fc_multiply<FullLayer>(stream, output_weight_matrix(use_inference_params).transposed(), tmp_dL_doutput, forward.hidden.at(tmp_idx), backward_tmp.at(backward_tmp_idx), m_activation, true);
	}

I suppose it's computing: forward.hidden.at(output_layer) = output_matrix.T * tmp_dl_doutput for a 2-hidden layer mlp network, it's something like:

$$ \delta^{l2} = \frac{\partial L}{\partial a^{l2}} = (W^{l3})^T \delta^{l3} * Relu'(a^{l2}) $$

so for fc_multiply , the epilogue suppose to be the derivate of Relu(activation function), rather than Relu itself ?

Thanks for guidance ZJ

Aug 21 '24 02:08 ZJLi2013