operator() is horrendously slow
I know this is not a bug but has to be reported. The MP2 assembly step becomes slower by a lot if I use BTAS' native operator().
(I previously wrote 50-100, but seems that was not quite right. It is slow, though.)
Do you mean x100 slower than nested loops with explicit offset computation or than DGEMM? A code snippet would be useful.
See above. I will measure the timing more accurately.
the BTAS code I am talking about is the following, which does not seem to be written for efficiency. The timing follows (takes a lot as it needs to recompile the entire BAGEL)
/// accesses element using its index, given as a pack of integers
template<typename index0, typename... _args>
typename std::enable_if<std::is_integral<index0>::value, const_reference>::type
operator() (const index0& first, const _args&... rest) const
{
typedef typename common_signed_type<index0, typename index_type::value_type>::type ctype;
auto indexv = {static_cast<ctype>(first), static_cast<ctype>(rest)...};
index_type index = array_adaptor<index_type>::construct(indexv.size());
std::copy(std::begin(indexv), std::end(indexv), std::begin(index));
return storage_[ range_.ordinal(index) ];
}
Code in BAGEL - I am talking about the loop in the middle (again, timing follows)
for (int n = 0; n != nloop; ++n) {
// take care of data. The communication should be hidden
if (n+ncache < nloop)
cache.block(n+ncache, n-1);
const int i = get<0>(cache.task(n));
const int j = get<1>(cache.task(n));
if (i < 0 || j < 0) continue;
cache.data_wait(n);
shared_ptr<const Matrix> iblock = cache(i);
shared_ptr<const Matrix> jblock = cache(j);
const Matrix mat(*iblock % *jblock); // V
Matrix mat2 = mat; // 2T-T^t
if (i != j) {
mat2 *= 2.0;
mat2 -= *mat.transpose();
}
Matrix mat3 = mat; // T
for (int a = 0; a != nvirt; ++a) {
for (int b = 0; b != nvirt; ++b) {
const double denom = -eig(a+nocc)+eig(i)-eig(b+nocc)+eig(j);
mat2(b,a) /= denom;
mat3(b,a) /= denom;
}
}
const double fac = i == j ? 1.0 : 2.0;
ecorr += mat.dot_product(mat2) * fac;
dmp2->add_block(2.0, nocca, nocca, nvirt, nvirt, mat2 % mat3);
if (i != j)
dmp2->add_block(2.0, nocca, nocca, nvirt, nvirt, mat2 ^ mat3);
}
It is most probably due to dynamic memory allocation in index construction. When you know the number of dimensions parametrize tensor by std::array. I do this when using btas with libint. On Feb 9, 2015 3:06 PM, "Toru Shiozaki" [email protected] wrote:
the BTAS code I am talking about is the following, which does not seem to be written for efficiency. The timing follows (takes a lot as it needs to recompile the entire BAGEL)
/// accesses element using its index, given as a pack of integers template<typename index0, typename... _args> typename std::enable_if<std::is_integral<index0>::value, const_reference>::type operator() (const index0& first, const _args&... rest) const { typedef typename common_signed_type<index0, typename index_type::value_type>::type ctype; auto indexv = {static_cast<ctype>(first), static_cast<ctype>(rest)...}; index_type index = array_adaptor<index_type>::construct(indexv.size()); std::copy(std::begin(indexv), std::end(indexv), std::begin(index)); return storage_[ range_.ordinal(index) ]; }— Reply to this email directly or view it on GitHub https://github.com/BTAS/BTAS/issues/75#issuecomment-73580007.
Sorry maybe this could be partly my fault. I will figure it out (also I should default to std::array now).
we definitely need an alias for constexpr number of dimensions ... perhaps something like this:
/// Tensor with const number of dimensions template <typename _T, size_t _N, CBLAS_ORDER _Order = CblasRowMajor, class _Storage = btas::DEFAULT::storage<_T>, class = typename std::enable_if<std::is_same<_T, typename _Storage::value_type>::value>::type > using TensorNd = Tensor<_T, RangeNd<_Order, std::array<long, _N>, btas::BoxOrdinal<_Order,std::array<long, _N>>>, _Storage >;
View::begin() seems very slow, which was perhaps responsible for my problem. just for your info.