BTAS operator() is horrendously slow

I know this is not a bug but has to be reported. The MP2 assembly step becomes slower by a lot if I use BTAS' native operator().

(I previously wrote 50-100, but seems that was not quite right. It is slow, though.)

Feb 09 '15 19:02 shiozaki

Do you mean x100 slower than nested loops with explicit offset computation or than DGEMM? A code snippet would be useful.

Feb 09 '15 19:02 evaleev

See above. I will measure the timing more accurately.

Feb 09 '15 20:02 shiozaki

the BTAS code I am talking about is the following, which does not seem to be written for efficiency. The timing follows (takes a lot as it needs to recompile the entire BAGEL)

      /// accesses element using its index, given as a pack of integers
      template<typename index0, typename... _args>
      typename std::enable_if<std::is_integral<index0>::value, const_reference>::type
      operator() (const index0& first, const _args&... rest) const
      {
        typedef typename common_signed_type<index0, typename index_type::value_type>::type ctype;
        auto indexv = {static_cast<ctype>(first), static_cast<ctype>(rest)...};
        index_type index = array_adaptor<index_type>::construct(indexv.size());
        std::copy(std::begin(indexv), std::end(indexv), std::begin(index));
        return storage_[ range_.ordinal(index) ];
      }

Feb 09 '15 20:02 shiozaki

Code in BAGEL - I am talking about the loop in the middle (again, timing follows)

    for (int n = 0; n != nloop; ++n) {
      // take care of data. The communication should be hidden
      if (n+ncache < nloop)
        cache.block(n+ncache, n-1);

      const int i = get<0>(cache.task(n));
      const int j = get<1>(cache.task(n));
      if (i < 0 || j < 0) continue;
      cache.data_wait(n);

      shared_ptr<const Matrix> iblock = cache(i);
      shared_ptr<const Matrix> jblock = cache(j);
      const Matrix mat(*iblock % *jblock); // V
      Matrix mat2 = mat; // 2T-T^t
      if (i != j) {
        mat2 *= 2.0;
        mat2 -= *mat.transpose();
      }   
      Matrix mat3 = mat; // T

      for (int a = 0; a != nvirt; ++a) {
        for (int b = 0; b != nvirt; ++b) {
          const double denom = -eig(a+nocc)+eig(i)-eig(b+nocc)+eig(j);
          mat2(b,a) /= denom;
          mat3(b,a) /= denom;
        }   
      }   

      const double fac = i == j ? 1.0 : 2.0;
      ecorr += mat.dot_product(mat2) * fac;
      dmp2->add_block(2.0, nocca, nocca, nvirt, nvirt, mat2 % mat3);
      if (i != j)
        dmp2->add_block(2.0, nocca, nocca, nvirt, nvirt, mat2 ^ mat3);
    }

Feb 09 '15 20:02 shiozaki

It is most probably due to dynamic memory allocation in index construction. When you know the number of dimensions parametrize tensor by std::array. I do this when using btas with libint. On Feb 9, 2015 3:06 PM, "Toru Shiozaki" [email protected] wrote:

the BTAS code I am talking about is the following, which does not seem to be written for efficiency. The timing follows (takes a lot as it needs to recompile the entire BAGEL)
  /// accesses element using its index, given as a pack of integers
  template<typename index0, typename... _args>
  typename std::enable_if<std::is_integral<index0>::value, const_reference>::type
  operator() (const index0& first, const _args&... rest) const
  {
    typedef typename common_signed_type<index0, typename index_type::value_type>::type ctype;
    auto indexv = {static_cast<ctype>(first), static_cast<ctype>(rest)...};
    index_type index = array_adaptor<index_type>::construct(indexv.size());
    std::copy(std::begin(indexv), std::end(indexv), std::begin(index));
    return storage_[ range_.ordinal(index) ];
  }
— Reply to this email directly or view it on GitHub https://github.com/BTAS/BTAS/issues/75#issuecomment-73580007.

Feb 09 '15 20:02 evaleev

Sorry maybe this could be partly my fault. I will figure it out (also I should default to std::array now).

Feb 09 '15 20:02 shiozaki

we definitely need an alias for constexpr number of dimensions ... perhaps something like this:

/// Tensor with const number of dimensions template <typename _T, size_t _N, CBLAS_ORDER _Order = CblasRowMajor, class _Storage = btas::DEFAULT::storage<_T>, class = typename std::enable_if<std::is_same<_T, typename _Storage::value_type>::value>::type > using TensorNd = Tensor<_T, RangeNd<_Order, std::array<long, _N>, btas::BoxOrdinal<_Order,std::array<long, _N>>>, _Storage >;

Feb 09 '15 21:02 evaleev

View::begin() seems very slow, which was perhaps responsible for my problem. just for your info.

Feb 09 '15 21:02 shiozaki