ncnn Add op remainder for all platform

[x] Remainder全平台实现
- [x] arm
- [x] loongarch
- [x] mips, working
- [x] riscv
- [x] vulkan
- [x] x86
[x] PNNX转换
[x] 单测

Aug 03 '23 15:08 FisherWY

Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

:white_check_mark: nihui
:x: FisherWY
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Aug 03 '23 15:08 tencent-adm

remainder 应该实现在 binaryop 里的...

Aug 03 '23 15:08 nihui

remainder 应该实现在 binaryop 里的...

是的，昨天参考了Paddle的文档，提PR后才发现Paddle和Torch的Remainder不一样😂，下一个commit会修正的 Paddle文档：链接 Torch文档：链接

Aug 04 '23 02:08 FisherWY

Codecov Report

Merging #4912 (1fd5705) into master (c45c01c) will decrease coverage by 0.05%. Report is 32 commits behind head on master. The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #4912      +/-   ##
==========================================
- Coverage   89.81%   89.76%   -0.05%     
==========================================
  Files         306      306              
  Lines       86875    86997     +122     
==========================================
+ Hits        78024    78091      +67     
- Misses       8851     8906      +55

Files Changed	Coverage Δ
src/layer/binaryop.cpp	`97.19% <0.00%> (-2.20%)`	:arrow_down:
src/layer/x86/avx512_mathfun.h	`99.00% <0.00%> (-1.00%)`	:arrow_down:
src/layer/x86/avx_mathfun.h	`98.79% <0.00%> (-1.21%)`	:arrow_down:
src/layer/x86/binaryop_x86.cpp	`98.11% <0.00%> (-1.69%)`	:arrow_down:
src/layer/x86/sse_mathfun.h	`98.78% <0.00%> (-1.22%)`	:arrow_down:

... and 6 files with indirect coverage changes

Aug 05 '23 00:08 codecov-commenter

ci 很多编译失败，需要修复

Sep 21 '23 02:09 nihui

ci 很多编译失败，需要修复

目前在x86上根据Torch提供的计算公式进行实现，但貌似结果没法对齐（test_binaryop挂）：torch.remainder(a, b) == a - a.div(b, rounding_mode="floor") * b，链接，🤔

Sep 21 '23 14:09 FisherWY

ci 很多编译失败，需要修复

目前在x86上根据Torch提供的计算公式进行实现，但貌似结果没法对齐（test_binaryop挂）：torch.remainder(a, b) == a - a.div(b, rounding_mode="floor") * b，链接，🤔

        float div_result = x / y;
        float round_result = roundf(div_result);
        float res = x - y * round_result;
        return res;

是这里的 roundf( x / y ) 和 div floor 不一样吧

Sep 26 '23 07:09 nihui

ci 很多编译失败，需要修复

目前在x86上根据Torch提供的计算公式进行实现，但貌似结果没法对齐（test_binaryop挂）：torch.remainder(a, b) == a - a.div(b, rounding_mode="floor") * b，链接，🤔
        float div_result = x / y;
        float round_result = roundf(div_result);
        float res = x - y * round_result;
        return res;
是这里的 roundf( x / y ) 和 div floor 不一样吧

遇到了一个奇怪的问题，复现步骤如下：

在src/layer/binaryop.cpp中写一个实现，返回值为0：

struct binary_op_remainder
{
    float operator()(const float& x, const float& y) const
    {
        return 0.0f;
    }
};

在src/layer/x86/binaryop_x86.cpp中实现x86平台，返回值同样为0：

struct binary_op_remainder
{
    float func(const float& x, const float& y) const
    {

        return 0.0f;
    }
#if __SSE2__
    __m128 func_pack4(const __m128& x, const __m128& y) const
    {
        __m128 res = _mm_setzero_ps();
        return res;
    }
#if __AVX__
    __m256 func_pack8(const __m256& x, const __m256& y) const
    {
        __m256 res = _mm256_setzero_ps();
        return res;
    }
#if __AVX512F__
    __m512 func_pack16(const __m512& x, const __m512& y) const
    {
        __m512 res = _mm512_setzero_ps();
        return res;
    }
#endif // __AVX512F__
#endif // __AVX__
#endif // __SSE2__

编译并运行单测，却会得到不同的结果：
请问这是什么原因造成的呢？（我的理解是单测是用src/layer/binaryop.cpp的计算结果跟对应平台的实现进行比对，请问是这理解有误吗？）

Oct 16 '23 08:10 FisherWY

ci 很多编译失败，需要修复

目前在x86上根据Torch提供的计算公式进行实现，但貌似结果没法对齐（test_binaryop挂）：torch.remainder(a, b) == a - a.div(b, rounding_mode="floor") * b，链接，🤔
        float div_result = x / y;
        float round_result = roundf(div_result);
        float res = x - y * round_result;
        return res;
是这里的 roundf( x / y ) 和 div floor 不一样吧

遇到了一个奇怪的问题，复现步骤如下：

1. 在`src/layer/binaryop.cpp`中写一个实现，返回值为0：

struct binary_op_remainder
{
    float operator()(const float& x, const float& y) const
    {
        return 0.0f;
    }
};

2. 在`src/layer/x86/binaryop_x86.cpp`中实现x86平台，返回值同样为0：

struct binary_op_remainder
{
    float func(const float& x, const float& y) const
    {

        return 0.0f;
    }
#if __SSE2__
    __m128 func_pack4(const __m128& x, const __m128& y) const
    {
        __m128 res = _mm_setzero_ps();
        return res;
    }
#if __AVX__
    __m256 func_pack8(const __m256& x, const __m256& y) const
    {
        __m256 res = _mm256_setzero_ps();
        return res;
    }
#if __AVX512F__
    __m512 func_pack16(const __m512& x, const __m512& y) const
    {
        __m512 res = _mm512_setzero_ps();
        return res;
    }
#endif // __AVX512F__
#endif // __AVX__
#endif // __SSE2__

3. 编译并运行单测，却会得到不同的结果：
   ![image](https://user-images.githubusercontent.com/32707008/275434958-8e9949fa-2e45-420b-949d-b218cbb2a881.png)

4. 请问这是什么原因造成的呢？（我的理解是单测是用`src/layer/binaryop.cpp`的计算结果跟对应平台的实现进行比对，请问是这理解有误吗？）

test layer gpu failed 表明 vulkan 的实现没有和 binaryop.cpp 对齐

Oct 16 '23 08:10 nihui

ci 很多编译失败，需要修复

目前在x86上根据Torch提供的计算公式进行实现，但貌似结果没法对齐（test_binaryop挂）：torch.remainder(a, b) == a - a.div(b, rounding_mode="floor") * b，链接，🤔
        float div_result = x / y;
        float round_result = roundf(div_result);
        float res = x - y * round_result;
        return res;
是这里的 roundf( x / y ) 和 div floor 不一样吧

遇到了一个奇怪的问题，复现步骤如下：

1. 在`src/layer/binaryop.cpp`中写一个实现，返回值为0：

struct binary_op_remainder
{
    float operator()(const float& x, const float& y) const
    {
        return 0.0f;
    }
};

2. 在`src/layer/x86/binaryop_x86.cpp`中实现x86平台，返回值同样为0：

struct binary_op_remainder
{
    float func(const float& x, const float& y) const
    {

        return 0.0f;
    }
#if __SSE2__
    __m128 func_pack4(const __m128& x, const __m128& y) const
    {
        __m128 res = _mm_setzero_ps();
        return res;
    }
#if __AVX__
    __m256 func_pack8(const __m256& x, const __m256& y) const
    {
        __m256 res = _mm256_setzero_ps();
        return res;
    }
#if __AVX512F__
    __m512 func_pack16(const __m512& x, const __m512& y) const
    {
        __m512 res = _mm512_setzero_ps();
        return res;
    }
#endif // __AVX512F__
#endif // __AVX__
#endif // __SSE2__

3. 编译并运行单测，却会得到不同的结果：
   ![image](https://user-images.githubusercontent.com/32707008/275434958-8e9949fa-2e45-420b-949d-b218cbb2a881.png)

4. 请问这是什么原因造成的呢？（我的理解是单测是用`src/layer/binaryop.cpp`的计算结果跟对应平台的实现进行比对，请问是这理解有误吗？）

test layer gpu failed 表明 vulkan 的实现没有和 binaryop.cpp 对齐

原来如此，非常感谢！

Oct 16 '23 10:10 FisherWY

ci 很多测试失败了 qaq

Oct 20 '23 02:10 nihui