Reduction of execution time with sparse format
Hi,
I'm trying to compare an iterative linear Regression approach by using Tensorflow (executed on the CPU) and taco (executed on the GPU). My goal is to show a tradeoff between Tensorflow (using dense format) and taco (using a sparse format). At a certain data density the sparse format should perform better than the dense format regarding the execution time. I randomly create 12 different data densities, but all with the same amount of rows and columns (all have a size of approximately 1GB/5GB/10GB).
My prediction in the case of taco is that if the density get smaller (more sparse/more zero values), the execution time gets better. Unfortunately this is not the case, taco needs for all densities the same amount of time.
I do the tests like the following:
I define the Matrix- und the Vector-format with exclusively sparse-dimensions.
Format dMatrix({Sparse,Sparse}); Format dVector({Sparse});
After that I fill those with my data, initializing the weight array and pack all.
Tensor<double>weightArray[101]; weightArray[0] = w; X.pack(); XT.pack(); y.pack(); w.pack();
At the end the formula for the iterative update of the weight is executed:
Loss(k) = (X(k,i) * weightArray[m](i) - y(k));
Loss.compile();
Loss.assemble();
//calc time
time(&start_t);
Loss.compute();
time(&end_t);
sum_time += difftime(end_t, start_t);
Grad(k) = (XT(k, i) * Loss(i)) * number_of_samples;
Grad.compile();
Grad.assemble();
//calc time
time(&start_t);
Grad.compute();
time(&end_t);
sum_time += difftime(end_t, start_t);
Weights(i) = w(i) - (Grad(i) * alpha);
Weights.compile();
Weights.assemble();
//calc time
time(&start_t);
Weights.compute();
time(&end_t);
sum_time += difftime(end_t, start_t); //std::cout << "Weights " << Weights.getStorage().getValues() << std::endl;
weightArray[m+1] = Weights;
Some notes: I'm trying to calculate the time without the compiling and assembling time, just the computing time. For some reason it appears that taco is not able to compile the formula at once, so it is separated in 3 sequential part-formulas.
Any ideas why the execution time is the same e.g. for data with a density of 0.001 (around 3.1s) and 1.0 (around 3.2s) for 1GB? Nothing really changes if the size of the data gets bigger.
Hi,
I wrote a program based on your description. One problem I noticed is that time() has a granularity of 1 second. I replaced it with a call to gettimeofday() with microsecond granularity, and I saw more meaningful results.
I do see some difference between different sparse densities. I used tensors of size 1000x1000, and sparse vectors of size 1000. I wrote some stupid code to insert random values, enough values to meet the requested density. The code is stupid because it doesn't pay attention to duplicate indexes, so it will produce fewer values than requested. But it's enough for a simple test.
Here are the timings I see, for various densities:
| density | average compute time (seconds) |
|---|---|
| 80% | 0.96 |
| 20% | 0.71 |
| 2% | 0.53 |
| 0.2% | 0.26 |
And the timings seem pretty consistent to me.
Here is the full program I used. The DENSITY definition controls how many nonzeroes it will insert. Could you compare this to your test program and verify the results?