taco Be able to rerun computation with different input data

A user on gitter asked whether it is possible to reset a Tensor, give it different data and rerun the computation. Currently this does not work, the tensor has the needsCompute flag set to false and there does not seem to be a way in the public API to reset it.

This program outputs the same data twice, and it should not:

#include "taco.h"

using namespace taco;

int main(int argc, char* argv[]) {
    IndexVar i{"i"};
    Tensor<double> X("X", {3}, Format({Dense}));
    Tensor<double> Y("Y", {3}, Format({Dense}));
    X(0) = 1.0;
    X(1) = 2.0;
    X(2) = 3.0;
    X.pack();

    Y(i) = X(i);
  
    Y.compile();
    Y.assemble();
    Y.compute();
    write("Y.tns", Y);

    X(0) = 3.0;
    X(1) = 2.0;
    X(2) = 1.0;
    X.pack();

    Y.assemble();
    Y.compute();
    write("Y2.tns", Y);
}

The Y2.tns file should have the values 3, 2, 1.

The goal is to avoid redoing the compile step. I think should be possible to reuse a generated kernel, as long as the algorithm, schedule, and input/output data formats are unchanged.

Oct 28 '20 23:10 Infinoid

If the goal is just to avoid having to recompile code that's needed to compute a specific kernel, I believe the C++ library does currently have a code caching mechanism that should automatically reuse any kernel that's been previously compiled. So if you have something like

IndexVar i, j;
Tensor<double> A, B, C, D; // all have the same dimensions and format
A(i,j) = B(i,j);
C(i,j) = D(i,j);
A.evaluate();
C.evaluate();

then A and C should both be run using the same compiled code (i.e., the library will not generate code for matrix assignment twice). So if a user simply wants to perform a single computation with different sets of data, then defining a new tensor would suffice.

That said, I think it's still a bug (with the lazy evaluation mechanism) that the example program above does not produce 3, 2, 1 in Y2.tns.

Oct 29 '20 01:10 stephenchouca

Does it do this across program runs, or just within a run of the program?

Oct 29 '20 10:10 hameerabbasi

I believe the C++ library does currently have a code caching mechanism that should automatically reuse any kernel that's been previously compiled.

Neat. I wrote an example program to try to make use of that. Code and output are here: https://gist.github.com/Infinoid/d0867b87216ef5daffd0df6844e4189e

I do see that the second invocation doesn't spend any time compiling. I also see that it's pretty sensitive to the layout and sizes of inputs. Even if the inputs are dense, switching from 3x3 matrix inputs to 4x4 causes a recompile.

Also, it seems like the output of the second computation is incorrect.

The first computation multiplies a (CSR) identity matrix by a (Dense) matrix of 1.0 values. I got a dense matrix of 1.0 values back, as expected.

Then, I multiplied the same CSR identity matrix by a (Dense) matrix of 2.0 values. I got a dense matrix of 3.0 values back!

If I change the values so that the inputs are 3.0 and then 5.0, I get 3.0 and then 8.0 back. It seems as though the output of the first operation was added into the output of the second operation.

Does it do this across program runs, or just within a run of the program?

It seems to only cache within a run of the program. If I run the gist example again, it still spends time compiling the first one.

Oct 29 '20 14:10 Infinoid

@Infinoid asked me to put an example of the timing difference between a version that relies on taco caching to avoid compilation and one that uses a hack to sidestep compilation. Code is here: https://gist.github.com/goldnd/1512190adbc4f92b75a0443ff100a94f The timing may be completely wrong for the taco cached example (timed using taco_timer1).

Oct 29 '20 14:10 goldnd

Also, it seems like the output of the second computation is incorrect.

The first computation multiplies a (CSR) identity matrix by a (Dense) matrix of 1.0 values. I got a dense matrix of 1.0 values back, as expected.

Then, I multiplied the same CSR identity matrix by a (Dense) matrix of 2.0 values. I got a dense matrix of 3.0 values back!

If I change the values so that the inputs are 3.0 and then 5.0, I get 3.0 and then 8.0 back. It seems as though the output of the first operation was added into the output of the second operation.

That's because = for modifying individual tensor elements is actually technically more like +=, so when you do B(i,j) = 2 that actually increments every element by 2. It would be nice if TACO can support = that overwrites existing elements, but the problem is that we currently don't have a way to generate code that'd be needed to efficiently support this semantics. (I believe @RawnH's work on generalizing TACO to support non-semirings might be able to address this though.)

Oct 29 '20 14:10 stephenchouca

Oh ok, thanks. If I hadn't reused the same input buffers, it would have worked fine.

Oct 29 '20 16:10 Infinoid