[WIP] performance refactor: batched kernel implementations
The main point of this PR is to change the implementation of the core finite element kernels from (pseudocode, using heat equation as an example):
for (auto element : mesh) {
// load the nodal temperatures for this element
auto element_temperatures = temperatures[element];
// integration loop
for (auto quadrature_point : integration_rule) {
// load the isoparametric coordinates, physical coordinates, and jacobian
// associated with this quadrature point
auto [xi, x, J] = integration_point_data[quadrature_point];
// compute the values of temperature, temperature gradient at this quadrature point
auto [u, du_dx] = trial_space::interpolate(element_temperatures, xi, J);
// evaluate the user's constitutive model at this quadrature point
auto [source, flux] = qf(x, u, du_dx);
// integrate the source and flux terms against shape functions and their derivatives (respectively)
// to get the element residual contribution from this quadrature point
element_residual += test_space::integrate(source, flux, xi, J);
}
}
to
for (auto element : mesh) {
// load the nodal temperatures for this element
auto element_temperatures = temperatures[element];
// load all the physical coordinates and jacobians
// for the quadrature points associated with this element
auto [x, J] = integration_point_data[element];
// compute the values of temperature and temperature gradient at each quadrature point
auto [u, du_dx] = trial_space::batch_interpolate(element_temperatures, xi, J);
// evaluate the user's constitutive model at this quadrature point
auto [sources, fluxes] = batch_apply(qf, x, u, du_dx);
// integrate the source and flux terms against shape functions and their derivatives (respectively)
// to get the complete element residual from all quadrature points
element_residual += test_space::batch_integrate(sources, fluxes, xi, J);
}
that is, to implement functions that perform batched interpolation and integration (acting on all of the quadrature points at once, rather than one at a time).
The main reason for this change is that computing these quantities in a batched way admits a more efficient algorithm that exploits the tensor product structure of the shape functions and quadrature rules to reduce the total amount of computation performed. To be clear: this more efficient algorithm is not an innovation by serac -- our initial kernel implementation was naive, so its performance was
To make this change possible, this PR requires that specializations of the finite_element class template implement two functions:
-
finite_element< geometry, function_space >::interpolate(element_values, jacobians, quadrature_rule)that takes in the values associated with an element, and returns an array of the interpolated values and derivatives for each quadrature point. ("E->Q") -
finite_element< geometry, function_space >::integrate(sources_and_fluxes, jacobians, quadrature_rule, element_residual)that takes in an array of the source and flux terms at each quadrature point, and numerically integrates them against the shape functions and shape function derivatives (respectively), adding the computed residuals to the last argument
Codecov Report
Merging #706 (b6db428) into develop (d5dcd67) will increase coverage by
0.08%. The diff coverage is98.52%.
@@ Coverage Diff @@
## develop #706 +/- ##
===========================================
+ Coverage 95.04% 95.12% +0.08%
===========================================
Files 142 146 +4
Lines 9440 9933 +493
===========================================
+ Hits 8972 9449 +477
- Misses 468 484 +16
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/serac/infrastructure/profiling.cpp | 50.00% <ø> (ø) |
|
| ...rac/numerics/functional/detail/metaprogramming.hpp | 100.00% <ø> (ø) |
|
| ...c/serac/numerics/functional/integral_utilities.hpp | 0.00% <ø> (-100.00%) |
:arrow_down: |
| src/serac/numerics/functional/polynomials.hpp | 100.00% <ø> (ø) |
|
| ...rics/functional/tests/functional_boundary_test.cpp | 80.00% <ø> (-4.54%) |
:arrow_down: |
| ...erics/functional/tests/functional_multiphysics.cpp | 100.00% <ø> (ø) |
|
| ...numerics/functional/tests/functional_nonlinear.cpp | 100.00% <ø> (ø) |
|
| ...ac/numerics/functional/detail/quadrilateral_L2.inl | 71.92% <71.92%> (-28.08%) |
:arrow_down: |
| ...rc/serac/numerics/functional/boundary_integral.hpp | 100.00% <100.00%> (ø) |
|
| .../numerics/functional/boundary_integral_kernels.hpp | 100.00% <100.00%> (ø) |
|
| ... and 41 more |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
https://github.com/LLNL/serac/blob/e9bcc81719b322201e85f451b87981c6de112687/src/serac/numerics/functional/boundary_integral.hpp#L31-L36
It seems that the doxygen block doesn't align with the class template parameters. I realize this is not code you modified in this PR, but I just caught it now.
https://github.com/LLNL/serac/blob/e9bcc81719b322201e85f451b87981c6de112687/src/serac/numerics/functional/boundary_integral.hpp#L31-L36
It seems that the doxygen block doesn't align with the class template parameters. I realize this is not code you modified in this PR, but I just caught it now.
Interesting, usually doxygen warns if the descriptions don't agree with the argument names. I wonder if it doesn't check consistency of template parameters?