Synthesized programs for agent benchmarking
Summary
For consistent results in agent benchmarking project we need to generate synthetic programs. By that I mean programs that, when profiled, would generate a profile with set characteristics.
For example, a program that looks like this:
func work(n int) {
for i := 0; i < n; i++ {
}
}
func fastFunction() {
work(2000)
}
func slowFunction() {
work(8000)
}
func main() {
for {
fastFunction()
slowFunction()
}
}
Would generate a profile that roughly looks like this:
main;fastFunction;work 200
main;slowFunction;work 800
or, visualized:

Proposed Solution
We could write code that would generate synthetic programs for a given profile.
Consider interface Generator:
type Generator interface {
// returns a function definition that performs some work and calls other functions
Function(name string, work int, children []string)
// returns a string representation of the program
Program() string
}
And an implementation of Generator interface for ruby:
type RubyGenerator struct {
functions []string
}
func (r *RubyGenerator) Function(name string, work int, children []string) {
r.functions = append(r.functions, fmt.Sprintf(`
def %s
i = 0
while i < %d; end
%s
end
`, name, work, strings.Join(children, "\n ")))
}
func (r *RubyGenerator) Program() string {
return fmt.Sprintf(`
%s
while true
main()
end
`, strings.Join(r.functions, "\n"))
}
Then if you call this interface like this:
var g Generator
g = &RubyGenerator{}
g.Function("main", 0, []string{"fast", "slow"})
g.Function("fast", 200, []string{})
g.Function("slow", 800, []string{})
fmt.Println(g.Program())
It generates valid (albeit ugly) ruby code. We could have these generators for other languages (go, python) and this way we could generate synthetic programs for agent benchmarking.
Here's a link to a golang playground with a working ruby example: https://play.golang.org/p/RvgamhyNX9p
Generating Realistic Profiles
The solution proposed above is the first step. Next step is generating profiles that are realistic. We could try generating profiles based on some initial parameters, like we do in loadgen. But we've found this approach to be a bit naive in replicating profiles found in real-world applications.
So instead I propose that we generate these programs using existing profiles from pyroscope db.
To make that work we could modify /render endpoint to accept a custom format parameter. And if this format parameter is set to something like ruby, we could generate synthetic programs based on the profile. Here's the relevant line of code that currently generates the profile in flamebearer format: https://github.com/pyroscope-io/pyroscope/blob/integrations-move/pkg/server/render.go#L64
We could replace it with something like this:
g = &RubyGenerator{}
out.Tree.GenerateProgram(g)
where GenerateProgram would be a new function similar to existing Iterate function from Tree that would call g.Function for each node in the tree.
This would enable us to generate realistic programs for any language to be used in benchmarking.
Additional Considerations
One thing that I haven't covered here is ability to simulate different levels of CPU utilization. Current implementation will always simulate 100% utilization. I'm thinking we could extend Generator interface with some Wait function that would simulate waiting for some amount of time. This way we could simulate different levels of CPU utilization.
I'm looking at some edge cases to better understand the domain.
Simulating different levels of CPU utilization
What are the requirements for simulating different levels of CPU utilization? Or, more specifically, does it need to be precise?
If we want precise simulation of a given CPU utilization level, we'd need to use clocks. Not only for the waits, but also for workloads. I also wonder if clock precision in various runtimes is a factor. Looking at a Pyroscope's pyroscope.server.cpu flamegraph, the precision is 0.01 seconds, which is probably trivial to achieve in any environment.
If we just want to simulate roughly defined CPU utilization levels like not-full and full, mixing pure incrementation loops (like in your examples) with some time-based timeouts would probably be fine.
Other profiling types
Presuming that the same mechansim is likely to be extended to support other profiling types (like allocations or object counts), can you think of any case that would require a fundamentally different approach? Any such case would be useful to know when working out the factoring.
Allocations, for example, might be a straightforward extension: each node would take turns looping, and each loop iteration would allocate the appropriate proportion of space or objects.
Though, how that translates into a profile might change depending on the runtime. For example, Java integration extends the async-profiler that (for allocations) only looks at some TLAB-related events, which in real applications roughly speaking translates to largest allocation sources, but not all allocations. I'm guessing that this isn't a big concern though.
Simulating different levels of CPU utilization
Doesn't have to be too precise, I would say even 20% variance is tolerable here.
Other profiling types
Simulating memory allocations is an interesting idea. I'm thinking this might become relevant in the future but right now CPU utilization is much more important.
see #519 for partially implemented version of this