How to implement custom finite differencing operators
I am interested in using this package to multithread some hyperbolic PDEs starting from the example of the 1D acoustic wave equation. The first step is to figure out how to implement my own finite difference operators (I use summation by parts operators that are not diagonal for example)
The documentation states that "Custom macros to extend the finite differences submodules or for other stencil-based numerical methods can be readily plugged in"
It is not obvious to me how to do this. Say I have a non-diagonal but sparse matrix D that I want to apply to the current state vector, how can I write a macro to do this?
@TakeTwiceDailey : thank you for your interest and sorry for the delay!
You need to express your computations with stencils, meaning you need to define computations on a cell-level. Have a look at the FiniteDifferences1D sub module: https://github.com/omlins/ParallelStencil.jl/blob/main/src/FiniteDifferences.jl#L1-L69
You can create macros like these or you can create your whole custom submodule, which you load instead or in addition to the FiniteDifferences submodule.
BTW: Please ask questions in the julia discourse; issues are for problems in the package or feature requests, see: https://github.com/omlins/ParallelStencil.jl#questions-comments-and-discussions
@omlins : Thanks for the reply. I have joined the Julia Slack, so I will be sure to use it in the future.
I just don't understand how to do a cell-level program without being inefficient. The finite differencing operators I use only use points within the domain, they don't bleed into "ghost points" like it seems the wave miniapp does, so the macro needs to be aware of where the bounds are.
Take my first attempt as an example:
macro d_x(A::Symbol)
esc(:(
if $ix == 1
(-$A[$ix,$iy ] + $A[$ix+1 ,$iy ])
elseif $ix == size($A,1)
( $A[$ix,$iy ] - $A[$ix-1 ,$iy ])
else
($A[$ix+1,$iy ] - $A[$ix-1 ,$iy ])/2
end
))
end
This works, but I fear the constant index value checking will be inefficient. Is there some best practices with writing operators like this to work well with ParallelStencil.jl?
@TakeTwiceDailey : You don't need to include the boundary handling in these macros. Just write them for the inner points. Then for the boundaries you can write separate kernels as you find in the following example:
https://github.com/omlins/ParallelStencil.jl/blob/main/miniapps/Stokes2D.jl#L115-L116
I can't even figure out how to properly select the interior points.
Following this example:
import ParallelStencil.INDICES
const ix, iy = INDICES[1], INDICES[2]
ixi, iyi = :($ix+1), :($iy+1)
macro inn(A::Symbol) esc(:( $A[$ixi ,$iyi ] )) end
I have absolutely no idea how or why ixi, iyi = :($ix+1), :($iy+1) manages to select A[2:end-1,2:end-1], as I would have expected A[2:end, 2:end]. Doing instead:
import ParallelStencil.INDICES
const ix, iy = INDICES[1], INDICES[2]
ixi, iyi = :($ix+2), :($iy+2)
macro inn(A::Symbol) esc(:( $A[$ixi ,$iyi ] )) end
Acts as I expect, as it apparently selects A[3:end,3:end]. Ultimately I want to be able to select A[n:end-n,n:end-n] or any n (so I can write an arbitrarily accurate finite differencing stencil in the interior without selecting points outside the domain), but no amount of fiddling has gotten me close to this without writing explicit if statements which I would think I should avoid.
In addition to defining the shift of the range start as you did, you also need to define the @within macro - also in analogy with the example: https://github.com/omlins/ParallelStencil.jl/blob/main/src/FiniteDifferences.jl#L61-L67
Or rather for 2D, it will be more similar to this: https://github.com/omlins/ParallelStencil.jl/blob/main/src/FiniteDifferences.jl#L176-L184
=> Replace -2 with -4 in your @within macro.
Thanks for your help @omlins
Here is what I have settled on (for those who may want to do that same thing as me), please let me know if you think it can be done better somehow. I disobeyed your advice to code the boundary regions into separate kernels, as I really don't want to have to write separate equations of motion just for those regions.
T = Float64
const ix, iy = INDICES[1], INDICES[2]
const b = 7
ixi, iyi = :($ix+b), :($iy+b)
macro inn(A::Symbol) esc(:( $A[$ixi ,$iyi ] )) end
const a1 = T(-1/60.)
const a2 = T(9/60.)
const a3 = T(-45/60.)
const a4 = T(45/60.)
const a5 = T(-9/60.)
const a6 = T(1/60.)
macro D_x(A::Symbol,ql::Symbol,qr::Symbol)
esc(:(
n=size($A,1);i=$ix;j=$iy;
if i in 7:n-6
(a1*$A[i-3,j]+a2*$A[i-2,j]+a3*$A[i-1,j]+a4*$A[i+1,j]+a5*$A[i+2,j]+a6*$A[i+3,j])
elseif i<b
($ql[i,1]*$A[ 1 ,j] + $ql[i,2]*$A[ 2 ,j] + $ql[i,3]*$A[ 3 ,j]
+ $ql[i,4]*$A[ 4 ,j] + $ql[i,5]*$A[ 5 ,j] + $ql[i,6]*$A[ 6 ,j]
+ $ql[i,7]*$A[ 7 ,j] + $ql[i,8]*$A[ 8 ,j] + $ql[i,9]*$A[ 9 ,j])
else
k=i-(n-b+1);
($qr[k,1]*$A[n-8,j] + $qr[k,2]*$A[n-7,j] + $qr[k,3]*$A[n-6,j]
+ $qr[k,4]*$A[n-5,j] + $qr[k,5]*$A[n-4,j] + $qr[k,6]*$A[n-3,j]
+ $qr[k,7]*$A[n-2,j] + $qr[k,8]*$A[n-1,j] + $qr[k,9]*$A[ n ,j])
end
))
end
macro d_y(A::Symbol)
esc(:(
( -0.5*$A[$ix,$iyi-1] + 0.5*$A[$ix,$iyi+1] )
))
end
macro within(macroname::String, A::Symbol)
if macroname == "@all" esc( :($ix<=size($A,1) && $iy<=size($A,2) ) )
elseif macroname == "@inn" esc( :($ix<=size($A,1)-2*b && $iy<=size($A,2)-2*b ) )
elseif macroname == "@D_x" esc( :($ix<=size($A,1) && $iy<=size($A,2) ) )
elseif macroname == "@d_y" esc( :($ix<=size($A,1) && $iy<=size($A,2)-2*b ) )
else error("unkown macroname: $macroname. If you want to add your own assignment macros,
overwrite the macro 'within(macroname::String, A::Symbol)';
to still use the exising macro within as well call
ParallelStencil.FiniteDifferences{1|2|3}D.@within(macroname, A) at the end.")
end
end
I was worried earlier in this thread that index value checking would be slow, but as I have come to find, this simply doesn't matter. This package just seems to be completely memory bandwidth limited (as discussed in the docs). I benchmarked both @D_x and @d_y here, which should be very different computationally, but they take the same amount of time to evaluate regardless of resolution.
At least on the GPU. If I change to the CPU, I get some weird behaviour where it seems the qr matrix gets randomly garbled and it gives incorrect results for the @D_x operator. The ql region is fine though. Does this have something to do with shared memory between threads?
Just two share my two cents, I usually find that it's better (or in the worst case, equally performant) to use closures than macros inside @parallel kernels -and this results in better-looking code. You could instead do something like:
Base.@propagate_inbounds @inline function D_x_fun(A, ql, qr)
n=size(A,1)
if i in 7:n-6
(a1*A[i-3,j]+a2*A[i-2,j]+a3*A[i-1,j]+a4*A[i+1,j]+a5*A[i+2,j]+a6*A[i+3,j])
elseif i<b
(ql[i,1]*A[ 1 ,j] + ql[i,2]*A[ 2 ,j] + ql[i,3]*A[ 3 ,j]
+ ql[i,4]*A[ 4 ,j] + ql[i,5]*A[ 5 ,j] + ql[i,6]*A[ 6 ,j]
+ ql[i,7]*A[ 7 ,j] + ql[i,8]*A[ 8 ,j] + ql[i,9]*A[ 9 ,j])
else
k=i-(n-b+1);
(qr[k,1]*A[n-8,j] + qr[k,2]*A[n-7,j] + qr[k,3]*A[n-6,j]
+ qr[k,4]*A[n-5,j] + qr[k,5]*A[n-4,j] + qr[k,6]*A[n-3,j]
+ qr[k,7]*A[n-2,j] + qr[k,8]*A[n-1,j] + qr[k,9]*A[ n ,j])
end
end
@parallel_indices (i,j) function foo(A, ql, qr)
D_x() = (i, j) -> D_x_fun(A, ql, qr) # define closure
@inbounds D_x() # plus whatever extra code here....
return nothing
end
macro within(macroname::String, A::Symbol) if macroname == "@all" esc( :($ix<=size($A,1) && $iy<=size($A,2) ) ) elseif macroname == "@inn" esc( :($ix<=size($A,1)-2b && $iy<=size($A,2)-2b ) ) elseif macroname == "@D_x" esc( :($ix<=size($A,1) && $iy<=size($A,2) ) ) elseif macroname == "@d_y" esc( :($ix<=size($A,1) && $iy<=size($A,2)-2*b ) )
@TakeTwiceDailey You only need to have in here the macros that are used at the left side of equal signs, which i think here would be just the first two.
Does this have something to do with shared memory between threads?
It should not. Does the problem occur also if you use only one thread?