Abdelrauf

Results 16 comments of Abdelrauf

@agibsonccc it seems that the case is done intentionally. first of all, while making one of the divide arguments to be fixed point I get that it should be a...

the initial improvements were already merged. - stride fix for all - nearest modes + coordTranformation for Nearest - bicubic coeff + coordTranformation + exclude_outside which is using v1 method,...

libnd4j always tries to delegate some ops to well-optimized platform kernels.. that matrix multiplication should be delegated to third-party/platform BLAS kernels(in this case it's mostly OpenBLAS). also, check performance with...

just checked the generic case for elementwise multiplication (the example that @pza94 provided) as it is calculated by libnd4j itself. I got vectorized codes(~~avx~~ sse) there as well. ~~Therefore no...

correction. the generic one is using sse mulps . the avx one is using vmulps but GCC adds extra instructions (probably because of some faulty chip arch) there which might...

well adding `-mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store` with `32byte alignment did `not give any noticeable changes on AMD CPU(Threadripper 3970X). as the op is memory bound it is understandable. ~~but what surprised me...

@daviddbal were using the same version of dl4j on intel as well?

@daviddbal sorry for asking again? so you were using 1.0.0-beta7 on both and there was performance degradation, right? and you were using openblas or mkl one? intel sometimes does crazy...

I just tested beta7 in centos6.8, it loaded the model ``` MultiLayerNetwork net2; try { net2 = MultiLayerNetwork.load(new File("lstm_model_rnd_09042020.ipdl4j"), true); var x = net2.getLayers(); for(var h : x){ System.out.println("-----" +h);...