Abdelrauf
Abdelrauf
@agibsonccc it seems that the case is done intentionally. first of all, while making one of the divide arguments to be fixed point I get that it should be a...
the initial improvements were already merged. - stride fix for all - nearest modes + coordTranformation for Nearest - bicubic coeff + coordTranformation + exclude_outside which is using v1 method,...
libnd4j always tries to delegate some ops to well-optimized platform kernels.. that matrix multiplication should be delegated to third-party/platform BLAS kernels(in this case it's mostly OpenBLAS). also, check performance with...
just checked the generic case for elementwise multiplication (the example that @pza94 provided) as it is calculated by libnd4j itself. I got vectorized codes(~~avx~~ sse) there as well. ~~Therefore no...
correction. the generic one is using sse mulps . the avx one is using vmulps but GCC adds extra instructions (probably because of some faulty chip arch) there which might...
well adding `-mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store` with `32byte alignment did `not give any noticeable changes on AMD CPU(Threadripper 3970X). as the op is memory bound it is understandable. ~~but what surprised me...
@daviddbal were using the same version of dl4j on intel as well?
@daviddbal sorry for asking again? so you were using 1.0.0-beta7 on both and there was performance degradation, right? and you were using openblas or mkl one? intel sometimes does crazy...
I just tested beta7 in centos6.8, it loaded the model ``` MultiLayerNetwork net2; try { net2 = MultiLayerNetwork.load(new File("lstm_model_rnd_09042020.ipdl4j"), true); var x = net2.getLayers(); for(var h : x){ System.out.println("-----" +h);...