stan Some computations not included in reported total time of output CSVs

Summary:

Stan writes to the output csv file of each chain something like

#  Elapsed Time: 
#                0.008 seconds (Warm-up)
#                0.004 seconds (Sampling)
#                0.012 seconds (Total)

but this total time doesn't include all computations that have to be done with the model.

Description:

If we look at run_adaptive_sampler, we see that the times warm_delta_t and sample_delta_t measure just the runtime of util::generate_transitions() for warmup and sampling phases, respectively. Then they are written with writer.write_timing(warm_delta_t, sample_delta_t), meaning that the reported total time is their sum. So for example for hmc_static_diag_e_adapt() things that don't get included in the total time are:

A) Finding the initial starting point if not given B) Calling sampler.init_stepsize() in run_adaptive_sampler()

There are also other things such as writing the CSV header and reading input but I listed these because these both include log probability and gradient evaluations with the model and therefore I think should be included in reported to measure model performance.

A) involves log probability and gradient evaluations until a point where they are finite is found (max 100 tries) B) involves gradient evaluations and leapfrog steps inside a while loop until some condition is satisfied (https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/base_hmc.hpp)

I found this because I have ODE models where sampler.init_stepsize() can require millions of evaluations of the ODE function (confirmed with profiling) and therefore takes a substantial proportion of the actual wall time. I have sent an example to @rok-cesnovar.

Interfaces:

In cmdstanr, there is also a Sys.time() timing around the entire program call, so that measures the actual wall time correctly. The $time() method of a fit object reports something like

$total
[1] 210.9202

$chains
  chain_id warmup sampling total
1        1  0.007    0.004 0.011
2        2  0.012    0.004 0.016

So the grand total time is correct, but total time of each chain doesn't involve the initialization of parameter values and step size.

Possible solutions

I am not sure what is the purpose of B) and why it is called even if step size if specified by user. To me this sounds like this is sort of warmup and could either be included in warmup time or then the reported time could have a separate "initialization" part that is included in total time of the chain.

Current Version:

v2.28.1

Nov 23 '21 00:11 jtimonen

@jtimonen. Thanks so much for the careful bug report.

For the timing reports, the only solutions that make sense to me are to (a) include all the time taken, or (b) just not try to report timing. It would be great if we could report log density/gradient evals for setting initial step size.

For (B), the user's initial stepsize is used as a starting point. I believe adaptation has to be turned off to have the stepsize not to be adapted. It would be nice if we could independently control what was adapted (as in being able to adapt step size and not metric or vice-versa).

Nov 23 '21 03:11 bob-carpenter

@bob-carpenter for the niche but important set of models with odes, the number of gradient evaluations is (potentially) much less informative than the runtime.

Nov 23 '21 07:11 nsiccha

The fix for B) is simple if we decide to add the amount of time we spend in sampler.init_stepsize() as part of warmup. Would that be fine?

A) is probably less critical, though ~100 grad evaluations can still be noticeable, but not really that much.

Nov 23 '21 08:11 rok-cesnovar

The fix for B) is simple if we decide to add the amount of time we spend in sampler.init_stepsize() as part of warmup

This would make sense, because for example CmdStan user guide says: "During warmup, the NUTS algorithm adjusts the HMC algorithm parameters metric and stepsize". And the fix would be just to move the start of warmup clock earlier in run_adaptive_sampler().

Including the time spent on A) requires editing more files. 100 grad evaluations doesn't sound like much but as pointed out by Niko, for models that involve adaptive iterative solvers (like ODE) the number of grad evaluations is not the best metric, as cost of one gradient evaluation is not constant. It depends on the parameter values and it is not uncommon that the initial evaluations take the longest time because difficult parameters are being proposed.

Nov 23 '21 11:11 jtimonen

Neither runtime or the No. of grad evaluations is a good measure of ode solver performance, tho I agree runtime is better if we have to pick one. To be more informative in ode models, we need

No. of grad evaluations for sampler performance.
Total runtime.
No. of RHS evaluations for explicit ODE solvers.
No. of nonlinear solver iterations for implicit ODE solvers.

Nov 23 '21 16:11 yizhang-yiz

I'm totally OK with including total run time and more fine-grained ODE solver reports. I didn't realize until now that they had a lot of extra overhead, but of course that makes sense given that they're doing derivatives and iterative solving with varying numbers of steps internally.

Do the algebraic solvers also have this problem? Even some of our scalar functions have iterative algorithms behind them that can vary in complexity.

Nov 23 '21 17:11 bob-carpenter

Do the algebraic solvers also have this problem?

Yes.

Nov 23 '21 17:11 yizhang-yiz

I'd find it great if there were some way to get more metrics about (e.g.) the ODE solution process. But, I think for now just closing the gap between reported runtime and actual wall time would be great!

Nov 23 '21 17:11 nsiccha

The original intent of the timing was to capture what was happing during the warmup and main sampling iterations, once the state space and sampler had been initialized. Adding an timer around the initializations shouldn’t be a problem, although it would change the format of the CmdStan CSV output and break analysis code that parses the “Elapsed Time:” section. Another potential issue is how “Total” should be defined — if “Total” is changed to include all three components then it would technically break backwards compatibility as performance metrics would change even though the sampler behavior hasn’t changed. The varied expectations people have for what these timings, however, suggest that the API guarantees for the timing were never well-defined in the first place.

I am not sure what is the purpose of B) and why it is called even if step size if specified by user. To me this sounds like this is sort of warmup and could either be included in warmup time or then the reported time could have a separate "initialization" part that is included in total time of the chain.

(B) initializes the symplectic integrator step size to a reasonable value before warmup iterations commence.

The “step size” argument provides only an initial guess for this procedure, not the final step size. This was implemented in the very first versions of Stan, I believe because so few understood the consequences for a given step size for a given model and even the most informed guesses were often way off; indeed this is largely still the case!

If this stage is taking millions of iterations there there’s a problem. At each iteration the step size is increased by 2 or decreased by 1 / 2; after 50 iterations of decrease the step size should already be at the threshold of double precision; at that point a single step shouldn’t change the system much unless the gradient is extremely poorly-behaved. Millions of iterations would imply something really wrong is going on with the evaluation of the log density at small perturbations away from the initial point.

I’m not again adding a max number of iterations here and err’ing out if the limit is hit to avoid these poor behaviors.

Dec 21 '21 17:12 betanalpha

the API guarantees for the timing were never well-defined in the first place.

That's why I've never used those internal timings. The timing I always care about is in my app, which I can time myself externally. Given this, I'd be OK just getting rid of the time reports.

We decided ages ago that unstructured text output isn't subject to backwards compatibility issues. It's not like it was going to reproduce in terms of ms anyway.

Dec 21 '21 18:12 bob-carpenter

If this stage is taking millions of iterations there there’s a problem.

It's not taking millions of those stepsize refinement iterations, just millions of evaluations of the ODE function as I said. One grad evaluation can call it it max_num_steps times which for the ODE solvers is by default 1e6 or 1e9. My point here was just to demonstrate that this stepsize init procedure can sometimes take a substantial amount of time

EDIT: and by the ODE function I mean the function defining the right-hand side of the system, not the solver function, sorry if that wasn't clear

Dec 21 '21 18:12 jtimonen

That's why I've never used those internal timings. The timing I always care about is in my app, which I can time myself externally. Given this, I'd be OK just getting rid of the time reports.

Yea, the problem is just that some people are using those reports. Like here: https://www.mdpi.com/2624-8611/3/4/48/htm . They say

For JAGS, the warmup run time [jags.model()] and the sampling run time [jags.samples()] were determined by the before-after difference of time stamps obtained from the function Sys.time(). For Stan, the warmup and sampling run times can be obtained from the console output of the function sampling() and are also retrievable from the returned results object (attributes(fitted@sim$samples[[1]])[["elapsed_time"]]

However, in addition to warmup and sampling, Stan’s sampling() function took another 215 seconds on average before returning the samples. Thus, users need to wait much longer for the results of the sampling process when using rstan instead of rjags.

I am not sure where rstan takes those times. Anyway the additional 215 here is probably some postprocessing.

Further, we encountered a relatively huge consumption of additional time besides warmup and sampling by rstan’s sampling() function. In fact, the additional time was about twice the time for all other steps combined. We are not sure what this function needs this additional time for. According to an anonymous reviewer, the additional time is partly due to calculating ESS and PSR. As this needs to be done for JAGS as well, fair software comparisons would also need to take this into account. Maybe future versions of rstan’s sampling() function may include an option to return the samples directly after sampling or explicitly report the time needed for additional calculations of convergence and precision statistics.

Dec 21 '21 18:12 jtimonen

Further, we encountered a relatively huge consumption of additional time besides warmup and sampling by rstan’s sampling() function. In fact, the additional time was about twice the time for all other steps combined. We are not sure what this function needs this additional time for. According to an anonymous reviewer, the additional time is partly due to calculating ESS and PSR. As this needs to be done for JAGS as well, fair software comparisons would also need to take this into account. Maybe future versions of rstan’s sampling() function may include an option to return the samples directly after sampling or explicitly report the time needed for additional calculations of convergence and precision statistics.

I would really like to know what the additional time here comes from. It seems to be a different additional time sink than what I originally started this thread for, but highlights that you need to understand quite a lot about the internals of Stan or your Stan interface to measure performance. If you want to just analyze the MCMC algorithm and compute ESS/second then I think that timing should be done so that it

includes the initial stepsize refinement, other warmup and sampling, but
does not include reading in data, or postprocessing computation of metrics like rhat.

Even then the performance measure will be affected a bit by whether you are writing the draws as output to CSV (like cmdstan) or storing them in memory (like rstan).

Dec 22 '21 12:12 jtimonen

That's why I've never used those internal timings. The timing I always care about is in my app, which I can time myself externally. Given this, I'd be OK just getting rid of the time reports.

One other thing is that if all the timing reports are removed, and one can only time the grand total time on their own, then it gets quite difficult to get information about how much time different chains took.

We decided ages ago that unstructured text output isn't subject to backwards compatibility issues.

CmdStanR is grepping that unstructured text and making structured output out of it: https://github.com/stan-dev/cmdstanr/blob/master/R/run.R#:~:text=if%20(state%20%3C%203%20%26%26%20grepl(%22Elapsed,%7D

So that is going to be a problem if it goes into CRAN as such and the output text of Stan CSVs changes.

Dec 22 '21 19:12 jtimonen

We will just remove that if we decide to remove it from the services and go back to reporting total times per chain as it was originally, that is not a big issue. I don't think we want to remove these timings though, and maybe find a way to include them or report them separately.

affected a bit by whether you are writing the draws as output to CSV (like cmdstan)

The performance hit from the CSV output is a bit exaggerated in general in my opinion. On my 6-year old system, outputting 1000 samples of 10k parameters takes 4s. So that means that the total performance penalty for the entire model with num_samples=1000 is 4s. With 50k parameters is 20s. I think 4 seconds for a model with 10k parameters fall within measurement noise more often than not, at least for non-trivial models.

Dec 22 '21 19:12 rok-cesnovar

We will just remove that if we decide to remove it from the services and go back to reporting total times per chain as it was originally, that is not a big issue<

But like if CmdStanR 1.0 is released soon to CRAN and then the public method fit$time() returns a list with total and chains (where chains has the warmup and sampling times) like it does now, and then a new Stan version comes which doesn't give the warmup and sampling times separately anymore, wouldn't CmdStanR technically have to go to 2.0 because it breaks compatibility? I actually don't know at all how things work if you have a CRAN package that has external system requirements.

The performance hit from the CSV output is a bit exaggerated in general in my opinion.

I don't think it is a big deal either currently, but just trying to list things that are not part of the MCMC algorithm itself and therefore depending on implementation / used libraries could in theory have an effect on performance metrics

Dec 23 '21 12:12 jtimonen

It's not taking millions of those stepsize refinement iterations, just millions of evaluations of the ODE function as I said. One grad evaluate it max_num_steps times which for the ODE solvers is by default 1e6 or 1e9. My point here was just to demonstrate that this stepsize init procedure can sometimes take a substantial amount of time

But what do you mean by “substantial” here? The state and sampler initialization require maybe hundreds of gradient evaluations for particularly difficult problems, but then that cost should be overwhelmed by the gradient evaluations needed for generating transitions during warmup and sampling. If one runs really short chains then this initialization can make a difference but those really short chains are also not recommended best practice. In other words for default sampler settings the warmup and sampling timings captures the bulk of the total cost, and for more non-standard settings other time quantifications (wrapping everything in a total time, using the profiler) can be used to identify other contributions.

The subtlety that ODE solvers, and other numerical solvers, introduce is the potential variation in the gradient evaluation time based on how ill-posed the ODE system is at each point. If the Markov chains are initialized in neighborhoods that are extremely ill-posed but the typical set contains much better-posed neighborhoods then the initialization cost can be much more substantial.

To be clear I’m not against adding more timing information, although I also believe that that information can be deduced from other facilities that are already available. My main point here is that the warmup and sampling timings were always intended to capture only the time cost of running the Markov chains. Tweaking the output to be more precise — for exampling replacing “Warmup”, “Sampling”, and “Total” with “Warmup Iterations”, “Main Sampling Iterations”, and “Total Iteration Time” or the like — might also be helpful.

That's why I've never used those internal timings. The timing I always care about is in my app, which I can time myself externally. Given this, I'd be OK just getting rid of the time reports.

One other thing is that if all the timing reports are removed, and one can only time the grand total time on their own, then it gets quite difficult to get information about how much time different chains took.

The sampler timings has always been intended for understanding sampler behavior and not overall algorithmic cost. Looking for variation between chains is perhaps its most useful application, but also looking for timing differences between iterations of model optimization and across computers can be really useful for gauging the overall change in sampler behavior. Also it’s these timings that matter for the ESS / time metric, which quantifies sampler performance and not performance of the entire library.

I am not sure where rstan takes those times.

I would really like to know what the additional time here comes from. It seems to be a different additional time sink than what I originally started this thread for, but highlights that you need to understand quite a lot about the internals of Stan or your Stan interface to measure performance. If you want to just analyze the MCMC algorithm and compute ESS/second then I think that timing should be done so that it

There are two issues, both exacerbated by the number of variables being captured in the output.

The actual packaging of the output into the stanfit object itself can be costly; I’m not exactly sure what the main bottleneck is (presumably there’s some transposing/rearranging being done that might cause the memory to start paging) but this has always been around. The delay becomes noticeable the more variables there are.

Secondly at some point RStan started running automated diagnostics which initially included Rhat and ESS for each output variable but not includes all of the modified Rhat and ESS calculations. Once there are many output variables these costs add up, especially for the modified diagnostics which are very, very expensive (PyStan only checks the standard Rhat and ESS but also caps the variables automatically analyzed to 1000 which can help, although the cap is pretty arbitrary). Unfortunately there’s no way to turn them off. Honestly this has been the most frustrating part about using RStan for nontrivial models over the last few versions.

We decided ages ago that unstructured text output isn't subject to backwards compatibility issues. It's not like it was going to reproduce in terms of ms anyway.

But like if CmdStanR 1.0 is released soon to CRAN and then the public method fit$time() returns a list with total and chains (where chains has the warmup and sampling times) like it does now, and then a new Stan version comes which doesn't give the warmup and sampling times separately anymore, wouldn't CmdStanR technically have to go to 2.0 because it breaks compatibility? I actually don't know at all how things work if you have a CRAN package that has external system requirements.

The problem with previous decisions about the CmdStan API guarantees is that CmdStanPy and CmdStanR have largely ignored them, which has baked in multiple dependencies to non-guaranteed CmdStan behavior. This is going to cause problems eventually.

Dec 24 '21 16:12 betanalpha

But what do you mean by “substantial” here? The state and sampler initialization require maybe hundreds of gradient evaluations for particularly difficult problems, but then that cost should be overwhelmed by the gradient evaluations needed for generating transitions during warmup and sampling.

The subtlety that ODE solvers, and other numerical solvers, introduce is the potential variation in the gradient evaluation time based on how ill-posed the ODE system is at each point. If the Markov chains are initialized in neighborhoods that are extremely ill-posed but the typical set contains much better-posed neighborhoods then the initialization cost can be much more substantial.

I've had cases where initialization takes something like 10-25% of total time with 2000 iterations. It has happened even when setting init=0 for a system where that point shouldn't be particularly ill-posed. The definition of neighborhood of the given initial parameters is quite fuzzy and this init_stepsize() function has something like

this->nom_epsilon_ = direction == 1 ? 2.0 * this->nom_epsilon_

which in my understanding means it can evolve the integrator repeatedly, each time doubling the nominal stepsize so it might go pretty far from the original point to some ill-posed region (I don't totally understand how direction is determined or motivation for it).

So yes bad initial points are a problem. But I am kind of suspecting that even with well-posed initial point and priors that have significant mass only on well-posed region, you cannot guarantee that you'd only ever need to do solve ODEs using well-posed parameters, if the system is ill-posed with some parameter combinations.

Jan 06 '22 21:01 jtimonen

Oh and in that case I was able to reduce the time required by init_stepsize() from that 10%-25% to very minimal amount by setting step_size=0.1 instead of the default step_size=1.

Jan 06 '22 21:01 jtimonen

I've had cases where initialization takes something like 10-25% of total time with 2000 iterations. It has happened even when setting init=0 for a system where that point shouldn't be particularly ill-posed. The definition of neighborhood of the given initial parameters is however quite fuzzy and this init_stepsize() https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/base_hmc.hpp function has something like

this->nom_epsilon_ = direction == 1 ? 2.0 * this->nom_epsilon_ which in my understanding means it can evolve the integrator repeatedly, each time doubling the nominal stepsize so it might go pretty far from the original point to some ill-posed region (I don't totally understand how direction is determined or motivation for it).

direction is set on line 104.

Once an initial point, with finite log density and log density gradient, has been found the step size is initialized to find a reasonable starting value where the change in Hamiltonian after one step is around log(0.8). This would correspond to a Metropolis acceptance probability of transitioning to that next step of 0.8.

To do this momenta, p, are sampled at the initial point, q, to give an initial point in phase space, z = (q, p). This initial phase space space is then evolved forward for one step with the symplectic integrator using the initial step size. At this new point the Hamiltonian is evaluated; if the change in Hamiltonian is smaller than log(0.8) then the initial stepsize is too large and needs to be reduced but if the change in Hamiltonian is larger than log(0.8) then the initial stepsize is too small and needs to be increased. To find the right increase/decrease the stepsize is doubled or halved repeatedly until the change in Hamiltonian after one step of the symplectic integrator from the initial point drops below/above log(0.8).

In other words the log density is only ever evaluated one step away from the initial point. That said if the sampled momenta is large, if the gradient at the initial point is large, or if the region of stability of an iterative solver using the log density calculation is very narrow around the initial point then those tests evolutions might require expensive solves.

So yes bad initial points are a problem. But I am kind of suspecting that even with well-posed initial point and priors that have significant mass only on well-posed region, you cannot guarantee that you'd only ever need to do solve ODEs using well-posed parameters, if the system is ill-posed with some parameter combinations.

Yeah unless you can guarantee that the region connecting the initial point and the posterior typical set is well-posed for an iterative solver then you might have to tackle expensive solves during warmup. If the solver is well-posed within the extent of the prior, and the posterior contracts within that extent, then initializing from the prior should be sufficient. Stan’s default initialization, however, may not make this possible.

The fact that you’re seeing such expensive solves after only a step suggests that the well-posed region is either very small or intricately shaped.

Oh and in that case I was able to reduce the time required by init_stepsize() from that 10%-25% to very minimal amount by setting step_size=0.1 instead of the default step_size=1.

Setting step_size=0.1 instead of step_size=1 changes the initial stepsize in the above procedure. That said the adaptation procedure would still need only 4 iterations to drop an initial step size of 1 to one below 0.1; in other words starting with 0.1 would save only four evaluations of the model. That suggests that the bulk of the expensive is coming from just a few iterative solves!

Note that this example demonstrates why this stepsize initialization is so useful. If the sampler started with stepsize=1 then it would evaluate the model multiple times along the initial Hamiltonian transition which would compound the expensive if a single solve is so expensive. By heuristically adapting the step size first with single steps the cost of the initial trajectories becomes much more manageable.

Anyways the conversation has drifted a bit from the original issue. The dynamic cost of adaptive solvers, all of which were introduced after the initial design of the interfaces, can cause the state and step size initialization routines to be a nontrivial contribution to the overall runtime. So long as all of the dependent interfaces adapt accordingly I don’t have any problem with adding a time estimate for the initialization procedure as well.

Jan 07 '22 02:01 betanalpha

Thank you very much for the explanations!

Jan 08 '22 05:01 jtimonen