FastTransforms.jl fast hermite transform?

I was talking to @dlfivefifty over at ApproxFun.jl where they have an implementation of the hermite transform using gausshermite quadrature. While this is clearly a reliable choice, is there any scope for implementing an O(N*logN) (i.e. faster than O(N^2)) method as described in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2630232/#!po=88.5000 ? For band limited functions there is a factorization using recursion the relation, akin to FFT vs basic FT. I don't know it well enough to know how practical it is to implement. On the surface it appears an O(N*logN) method may be possible for any orthogonal polynomials with two-term recursion...

EDIT: it seems that the exact two-term recursion factorisation is numerically unstable, and in practice the transform is implemented as an approximate Hermite-Newton-Cotes transform on a linear grid. Probably my question should be closed here.

Jun 09 '17 21:06 AshtonSBradley

A fast Hermite transform is definitely in the scope of this package.

I just finished reading the paper and I have a couple comments:

the FHT appears to be unstable past degree 64,
at such low degrees, I'm not convinced that anything can be made faster than naive matrix-vector multiplication on current architectures,
the convergence of the Hermite-Newton-Cotes transform appears to be algebraic, governed by the order k of the interpolation used (N.B. it's well-known that Newton-Cotes of high degree is also unstable so it's not as simple as just cranking up the order),
there appears to be an O(M^2) pre-computation,

and questions:

do you have an immediate need for an FHT?
is the algorithm you pointed to sufficient for your needs?
have you considered mapped Chebyshev expansions, or must you be using Hermite functions?

One option is to write a wrapper in FastTransforms.jl for the C code on Rockmore's website. But sometimes getting a Pkg.build to work across all platforms is more hassle than it's worth. Another option is to develop a better Hermite transform. @ajt60gaibb, is there any low-rank magic here?

Jun 09 '17 23:06 MikaelSlevinsky

Hi Mikael, thanks for your comments.

I do need a fast FHT for degree 30-300 as it is the main bottleneck of a time evolution algorithm that my group uses quite a bit. In practice we do separate HT over x,y,z using separability. However, I appreciate that there may not be anything faster than O(N^2) for these values of N. Precomputation is not an issue as we do the same transform many times during time evolution. I have my own version of basically ApproxFun's hermitetransform.jl that gets past overflow in precomutation by using BigFloat to build the transform matrix. This allows me to push the degree up to about 350.

Mapped Chebyshev expansion sounds very interesting. Does this come down to a Clenshaw-Curtis quadrature to evaluate on a larger grid than the gauss points?

Jun 09 '17 23:06 AshtonSBradley

Since Hermite functions are scaled parabolic functions (see http://dlmf.nist.gov/18.11#E3) and parabolic functions have a known trigonometric asymptotic expansion (see http://dlmf.nist.gov/12.10#E18), there should be a fast Hermite transform. That is, in the "large parameter" region of the Hermite transform matrix, the matrix can be well-approximated by a sum of nonuniform DCTs and nonuniform DSTs (low-rank in frequency space). With a few weeks of painful work, this could probably be a fast Hermite transform. (Lot's of details to work out, though.) Of course, to do this one needs to appropriately scale the Hermite polynomials to avoid overflow entirely and hard thresholding entries to 0 that underflow. I think an O(n^{3/2}) transform may come from hard thresholding entries to 0 that underflow...

@MikaelSlevinsky, I imagine the transform matrix is butterflyable too.

In short, a fast Hermite transform seems possible to me.

On and off I discuss a fast Hermite transform with @daanhb, but there was never enough enthusiasm for it for me to put in the effort required.

Jun 10 '17 01:06 ajt60gaibb

Unfortunately, there seem to be some claims in the paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2630232/#!po=88.5000 that are unsubstantiated.

Jun 10 '17 01:06 ajt60gaibb

I think Hermite transforms are useless. But what you actually want is weighted Hermite transforms: that is GaussWeight(Hermite()) in ApproxFun.

The observed stability issues should disappear when you work with weighted Hermite, though you still have to deal with underflow

Jun 10 '17 01:06 dlfivefifty

Hermite transforms are almost clearly nonsense when you think about n -> infinity as the tail will be wrong unless your function is a polynomial, and even if it's a polynomial round off error prevents getting the tail exactly right.

This probably also answers the question about Laguerre transforms

Jun 10 '17 01:06 dlfivefifty

I guess you are thinking about the interpolation/approximation problem. I was thinking the transform is: "Given a Hermite expansion, evaluate it at some points"...

Jun 10 '17 01:06 ajt60gaibb

Ah good point. I suspect that's also useless though.

Jun 10 '17 02:06 dlfivefifty

👍

Jun 10 '17 02:06 ajt60gaibb

Ok interesting, I see I had missed some important parts of ApproxFun. Thanks all for your input.

Regarding comments above, my function is a sum of Hermite polynomials times exp(-x.^2/2). I have tried using GaussWeight() and it doesn't seem to lead to any extension of the highest order for Hermite, nor removing the need to use BigFloat.

For Laguerre, I can change the call to WeightedLaguerre but the instability at N=27 seems to persist.

Jun 10 '17 02:06 AshtonSBradley

@AshtonSBradley,

Mapped Chebyshev expansion sounds very interesting. Does this come down to a Clenshaw-Curtis quadrature to evaluate on a larger grid than the gauss points?

By this, I mean using Chebyshev polynomials and an affine variable transformation to work on an interval [a, b] sufficiently large enough.

Unfortunately, I left my laptop at work, but on Monday I will try something along the lines of:

using ApproxFun, FastTransforms

import FastTransforms: Butterfly

P = # Get the actual matrix from the plan in ApproxFun.plan_transform(::Hermite(),rand(256)), say.

BF = Butterfly(P, floor(Int, log2(size(P, 1))-6); sketch = :none, atol = 1e-32);

r = FastTransforms.allranks(BF) # just diagnostic info.

mean(r), std(r)

x = rand(256);
y = zero(x);
z = zero(x);

@time A_mul_B!(y, P, x);

@time A_mul_B!(z, BF, x);

norm(y-z) # Did it work?

Jun 10 '17 02:06 MikaelSlevinsky

I think an O(n^{3/2}) transform may come from hard thresholding entries to 0 that underflow...

this would probably be a decent intermediate solution, as it would come down to calling gemv! for an n x O(n^{1/2}) matrix.

Jun 10 '17 02:06 MikaelSlevinsky

@MikaelSlevinsky , here is a result (seems to work!)

using ApproxFun, FastTransforms, BenchmarkTools

import FastTransforms: Butterfly

N=256
# Get the actual matrix from the plan in ApproxFun.plan_transform(::Hermite(),rand(256)), say.
PlanHerm = ApproxFun.plan_transform(Hermite(),rand(N))
Px,Py=PlanHerm.plan
P=Px*Py'

BF = Butterfly(P, floor(Int, log2(size(P, 1))-6); sketch = :none, atol = 1e-32);

r = FastTransforms.allranks(BF) # just diagnostic info.
4×3 Array{Int64,2}:
 0  1  1
 1  1  1
 1  1  1
 0  1  1

mean(r), std(r)
(0.8333333333333334,0.38924947208076144)

x = rand(N);
y = zero(x);
z = zero(x);

#already ran a warmup
@time A_mul_B!(y, P, x)
  0.000147 seconds (4 allocations: 160 bytes)

@time A_mul_B!(z, BF, x)
  0.000016 seconds (4 allocations: 160 bytes)

norm(y-z) # Did it work?
2.3778381601473506e-14

Jun 10 '17 20:06 AshtonSBradley

is that the actual matrix, though? Seems to me it should have the Vandermonde part to be mathematically full rank:

PlanHerm = ApproxFun.plan_transform(Hermite(),rand(N));
x,w = PlanHerm.plan;

V=ApproxFun.hermitep(0:N-1,x)';
nrm=(V.^2)*w;
P = Diagonal(inv.(nrm))*V*Diagonal(w)

Jun 10 '17 20:06 MikaelSlevinsky

ok, it overflows for N=256 (nrm explodes for Hermite). For N=128:

r = FastTransforms.allranks(BF) # just diagnostic info.
2×2 Array{Int64,2}:
 15  20
 15   0

mean(r), std(r)
(12.5,8.660254037844387)

@time A_mul_B!(y, P, x)
  0.000076 seconds (4 allocations: 160 bytes)

@time A_mul_B!(z, BF, x)
  0.000017 seconds (4 allocations: 160 bytes)

norm(y-z) # Did it work?
4.626203755857375e-16

My end goal with this is to apply this optimization to the action of three matrices (one for each spatial direction). I guess I should flatten the 3 matrix product on a 3D array into a single matrix multiply on a vector, and then apply this procedure to it?

Jun 10 '17 20:06 AshtonSBradley

Great! so if everything's computed stably, the butterfly algorithm should take O(mean(r) n^2) time to pre-compute, and O(mean(r)*N log(N)) to execute.

Jun 10 '17 20:06 MikaelSlevinsky

I wouldn't be surprised if the butterfly pre-computation would do just as well as an O(n^{3/2}) algorithm (for intermediate n) resulting from a lot of tedious analysis, since that's mainly built from hard thresholding which could be accounted for in the butterfly.

Jun 10 '17 20:06 MikaelSlevinsky

awesome! any scope for optimising parameters in Butterfly ?

Jun 10 '17 20:06 AshtonSBradley

they're almost all passed through to LowRankApprox. So if you look at https://github.com/klho/LowRankApprox.jl/blob/master/src/LowRankApprox.jl#L58-#L101, and the repo's readme, you might be able to decipher what they all mean 😉.

Jun 10 '17 21:06 MikaelSlevinsky

By the way, this is why I implemented the butterfly algorithm https://arxiv.org/abs/1705.05448

Jun 10 '17 21:06 MikaelSlevinsky

thanks for the references! I noticed that if I try to act with BF on a complex vector, I get a method error. Should I be handling complex variables in a special way, or just transform real and imaginary parts separately?

Jun 10 '17 21:06 AshtonSBradley

complex methods haven't been implemented yet, so separate transformation is required at the moment.

Jun 10 '17 22:06 MikaelSlevinsky

I doubt you can do better than applying the transform to real and imaginary parts separately.

Is there a way in Julia to get a view of the real/imag part of a complex number? I think it should be possible by playing around with the stride.

Jun 11 '17 05:06 dlfivefifty

My end goal with this is to apply this optimization to the action of three matrices (one for each spatial direction). I guess I should flatten the 3 matrix product on a 3D array into a single matrix multiply on a vector, and then apply this procedure to it?

FastTransforms has an unexported method A_mul_B_col_J!, https://github.com/MikaelSlevinsky/FastTransforms.jl/blob/master/src/SphericalHarmonics/Butterfly.jl#L187, which applies the matrix A to the Jth column of matrix B and stores the result in the provided output matrix. Looping over all columns in a matrix, then this could accelerate matrix-matrix product. It would be straightforward to write a method A_mul_B_col_J_slice_K! to accelerate applying the matrix A to every column in a tensor B.

Since Julia is column-major ordering, it should be just as efficient to then "rotate" the data (in-place) twice, each time applying the transform along the other faces of the cube of data. The alternative is to write methods that implement A_mul_B!-like methods through rows and tubes of the data, but this involves accessing data in larger and larger strides which would be inefficient, or just as costly as rotating the data, and unnecessarily complicated I think.

Jun 11 '17 21:06 MikaelSlevinsky

I have competing constraints - the advantage is only worthwhile if the transform size is large. For a large basis, I need to do recurrence on the orthonormal eigenfunctions, not the Hermite polynomials themselves, in order to get there with numerical stability. There is still a two-term recurrence, suggesting there would be something to gain, but when I butterfly the matrix, there is very little gain (less than a factor of 2, even for N=512). Is this what you would expect? Does the transform need to be in strict Vandermonde form (maybe it still is, but I haven't looked hard at this) or is it sufficient to have entries related by two-term recurrence?

Jun 12 '17 04:06 AshtonSBradley

It's because the butterfly is an asymptotically fast algorithm. The original authors found a break even point of a square matrix of around 256 x 256, and it appears fully quasi-linear only beyond about 20,000 x 20,000.

It would probably require a specialized algorithm (and a superb implementation) for significant improvement for dimensions up to 512.

Cheers,

Mikael

On Jun 11, 2017, at 11:10 PM, Ashton Bradley <[email protected]mailto:[email protected]> wrote:

I have competing constraints - the advantage is only worthwhile if the transform size is large. For a large basis, I need to do recurrence on the orthonormal eigenfunctions, not the Hermite polynomials themselves, in order to get there with numerical stability. There is still a two-term recurrence, suggesting there would be something to gain, but when I butterfly the matrix, there is very little gain (less than a factor of 2, even for N=512). Is this what you would expect? Does the transform need to be in strict Vandermonde form (maybe it still is, but I haven't looked hard at this) or is it sufficient to have entries related by two-term recurrence?

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/MikaelSlevinsky/FastTransforms.jl/issues/17#issuecomment-307685996, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHzBpURuMNaEIQh7BJ-UmCdMCnURUrqlks5sDLotgaJpZM4N12gg.

Jun 12 '17 04:06 MikaelSlevinsky

ok interesting. For N=2048 it becomes a factor of 5. For 2D this is now a significant gain.

Jun 12 '17 04:06 AshtonSBradley

P.S. I've made some minor changes (on the master branch) that seem to speed up the butterfly matrix-vector product. My commitment is to make it faster over time!

Jun 30 '17 18:06 MikaelSlevinsky

@dlfivefifty added a medium-fast Hermite transform in https://github.com/JuliaApproximation/FastTransforms.jl/pull/74.

Sep 07 '19 23:09 MikaelSlevinsky