FastMemcpy icon indicating copy to clipboard operation
FastMemcpy copied to clipboard

Speed-up over 50% in average vs traditional memcpy in gcc 4.9 or vc2012

Results 8 FastMemcpy issues
Sort by recently updated
recently updated
newest added

https://github.com/skywind3000/FastMemcpy/blob/8fea5f666be174c6548d0ae4010e81b0a742c853/FastMemcpy.h#L644 Hi, it bugs me as 128 seems to be a reasonable choice. Is that derived from experiments? Or something related to the mechanism of prefetching itself?

This actually appears to be slower on GCC 5.4 > benchmark(size=32 bytes, times=16777216): > result(dst aligned, src aligned): memcpy_fast=42ms memcpy=48 ms > result(dst aligned, src unalign): memcpy_fast=46ms memcpy=54 ms >...

Maybe quite naive, but why use `mm_sfence` if size >= L2 cache size? https://github.com/skywind3000/FastMemcpy/blob/master/FastMemcpy.h#L680 And what if L2 cache size (0x200000) is not actually L2 cache size, is there any...

`gcc version 10.2.1 20201007 releases/gcc-10.2.0-350-g136256c32d (Clear Linux OS for Intel Architecture) ` > ./FastMemcpy > benchmark(size=32 bytes, times=16777216): > result(dst aligned, src aligned): memcpy_fast=48ms memcpy=35 ms > result(dst aligned, src...

和现在的 MCFCRT 比较了一下,因为 MCFCRT 不打算支持 AVX 就只测试了 SSE 的(实际上是懒得改,其实比较简单,目前的复制操作都是两个连续 `movups` 打包的,这地方改改就能支持 AVX): ![4311](https://user-images.githubusercontent.com/5071344/35473846-b0aeb952-03c0-11e8-8ffa-5ed06979666d.png) ```plaintext gcc (gcc-7-branch HEAD with MCF thread model, built by LH_Mouse.) 7.3.1 20180125 Copyright (C) 2017 Free...

…w-w64 targets. On MinGW and mingw-w64 targets, `memcpy()` is imported from MSVCRT.DLL. With regard to benchmarking purposes, we have to eliminate the overhead of implicit importation by specifying `dllexport` explicitly....

As more and more people use servers with the arm64 architecture, supporting the arm64 architecture with SIMD becomes meaningful.