TensorComprehensions Bitwise Operation Support

Tensor Comprehensions seem like a great way to implement binary approximation of layers as in BinaryNet. However, tc does not currently support bitwise operations such as xnor (^), or (|), and and (&). Additionally, implementing these approximations would require the popc cuda intrinsic, which is no supported. Getting these functions integrated would enable really cool and fast applications that could highlight the strengths of tc.

Mar 08 '18 22:03 jwfromm

Thank you for your feedback. Indeed, we have been mostly working on floating point tensors and some support for integers may be missing (tag #98).

Would you mind giving us a relalistic example of how you would write some TC statements with bitwise operations so that we can keep track of what is already implemented?

Also note that CUDA is not the only architecture that we intend to support, so the support for intrinsics should be at the TC level allowing to emit code for any backend. So intrinsics support probably deserves a separate issue.

Mar 08 '18 22:03 ftynse

I'd like to write something like this

CONV_LANG = """
def convolution(float(N,C,H,W) I, float(M,C,KH,KW) W1) -> (Xout) {{
   # binarize weight tensor
   bin_W(m, c, kh, kw) = W1(m, c, kh, kw) > 0
   # binarize input tensor
   bin_I(n, c, h, w) = I(n, c, h, w) > 0
   # compress binarized weights to 64 bit integers (b here would be 64)
   bin_W_compressed(m, c, kh, kw) +=! W1(m, c+b, kh, kw)
   # compress binarized inputs, requires shifting of bits and accumulating
   bin_I_compressed(n, c, h, w) +=! bin_I(n, c+b, h, w) << b
   # perform convolution using popcount-xnor instead of multiply
   Xout(n, m, h, w) +=! popc(bin_I_compressed(n, c, {sh} * h + kh, {sw} * w + kw) ^ bin_W_compressed(m, c, kh, kw))
}}
"""

Above, the first two steps (binarizing weights and inputs) already works fine thanks to inequality being supported. However, all other operations require bitwise operations that are not supported. Bitwise operations and the popcount operator are supported on most architectures, not just CUDA. I've written a similar pipeline in pure Halide and had no trouble compiling it to it's many backends.

Mar 08 '18 23:03 jwfromm

For reference, here is the equivalent (fully tested and functional) Halide pipeline.

Var x("x"), y("y"), c("c"), k("k");
Func clamped;
clamped(x, y, c) = BoundaryConditions::constant_exterior(input, 0)(x, y, c);
Func binclamped;
RDom b(0, 64);
binclamped(x, y, c) = sum(select(clamped(x, y, 64*c + b) > 0, cast<int64_t>(1) << b, cast<int64_t>(0)), "binarize_input");
RDom r(0, size, 0, size, 0, binchannels);
output(x, y, k) = -cast<float>((2 * (sum(popcount(weights(r.x, r.y, r.z, k) ^ binclamped(x * stride + r.x - pad, y*stride + r.y - pad, r.z)    )))) - bin_adjust);

Mar 08 '18 23:03 jwfromm

@skimo-openhub , is it possible to propagate the bitwise operations to isl atm? if not, what would it involve? can you provide some guidance on it?

Apr 25 '18 17:04 prigoyal

I'd say isl should handle this transparently as long as bitwise operations don't appear in index expressions. In this example they do not, so the first task would be adding those operations to lexer/parser/sema and tc2halide conversion.

Apr 25 '18 18:04 ftynse

cool, I'll do that first @ftynse , thanks much for the guidance.

Apr 25 '18 20:04 prigoyal

If you can also handle the % operator (#290) on the way, it would be amazing!

Apr 25 '18 21:04 ftynse