Blocking vs async performance
Hi! Recently did a small test to measure the performance and got very surprising results. (on aurora merged)
Average time per 1000 runs per 34707 verts: 336ms 231us 70ns
Average time per 1000 runs per 1000 verts: 9ms 889us 149ns
Average time per 1000 runs per 34707 verts (async): 217us 79ns
Average time per 1000 runs per 1000 verts (async): 6us 384ns
Can someone explain if maybe I did something wrong (see code below) or how the async version is so much faster? It doesn't look multithreaded, so how?
#![feature(core_intrinsics)]
use std::sync::Arc;
use glam::Vec2;
use humantime::format_duration;
use polyanya::{Mesh, PolyanyaFile};
use tokio::sync::RwLock;
fn run_and_measure<F: FnOnce() -> R, R>(f: F) -> (R, std::time::Duration) {
let start = std::time::Instant::now();
let result = f();
let elapsed = start.elapsed();
(result, elapsed)
}
#[tokio::main]
async fn main() {
let mesh: Mesh = PolyanyaFile::from_file("meshes/aurora-merged.mesh").into();
let mesh_min_x = mesh
.vertices
.iter()
.map(|v| v.coords.x)
.fold(f32::INFINITY, |a, b| a.min(b));
let mesh_max_x = mesh
.vertices
.iter()
.map(|v| v.coords.x)
.fold(f32::NEG_INFINITY, |a, b| a.max(b));
let mesh_min_y = mesh
.vertices
.iter()
.map(|v| v.coords.y)
.fold(f32::INFINITY, |a, b| a.min(b));
let mesh_max_y = mesh
.vertices
.iter()
.map(|v| v.coords.y)
.fold(f32::NEG_INFINITY, |a, b| a.max(b));
let width = mesh_max_x - mesh_min_x;
let height = mesh_max_y - mesh_min_y;
let runs = 10_000;
let duration = run_and_measure(|| {
for _ in 0..runs {
let start_x = rand::random::<f32>() * width + mesh_min_x;
let start_y = rand::random::<f32>() * height + mesh_min_y;
let end_x = rand::random::<f32>() * width + mesh_min_x;
let end_y = rand::random::<f32>() * height + mesh_min_y;
std::intrinsics::black_box(
mesh.path(Vec2::new(start_x, start_y), Vec2::new(end_x, end_y)),
);
}
});
let duration_per_1000_runs = duration.1 / (runs / 1_000);
let vertices = mesh.vertices.len();
println!(
"Average time per 1000 runs per {} verts: {}",
vertices,
format_duration(duration_per_1000_runs).to_string()
);
let duration_per_1000_runs_per_1000_verts = duration_per_1000_runs / (vertices as u32 / 1_000);
println!(
"Average time per 1000 runs per 1000 verts: {}",
format_duration(duration_per_1000_runs_per_1000_verts).to_string()
);
// Now try the async version
let runs = 1_000_000;
let rw_lock_mesh = Arc::new(RwLock::new(mesh));
let duration = run_and_measure(|| {
let mut futures = vec![];
for _ in 0..runs {
let start_x = rand::random::<f32>() * width + mesh_min_x;
let start_y = rand::random::<f32>() * height + mesh_min_y;
let end_x = rand::random::<f32>() * width + mesh_min_x;
let end_y = rand::random::<f32>() * height + mesh_min_y;
let rw_lock_mesh = rw_lock_mesh.clone();
futures.push(async move {
let mesh = rw_lock_mesh.read().await;
std::intrinsics::black_box(
mesh.get_path(Vec2::new(start_x, start_y), Vec2::new(end_x, end_y))
.await,
);
});
}
futures::future::join_all(futures)
});
let duration_per_1000_runs = duration.1 / (runs / 1_000);
println!(
"Average time per 1000 runs per {} verts (async): {}",
vertices,
format_duration(duration_per_1000_runs).to_string()
);
let duration_per_1000_runs_per_1000_verts = duration_per_1000_runs / (vertices as u32 / 1_000);
println!(
"Average time per 1000 runs per 1000 verts (async): {}",
format_duration(duration_per_1000_runs_per_1000_verts).to_string()
);
}
Thanks for the answer in advance! Very interested in the details :P
I tried in a benchmark using criterion
path sync time: [326.31 µs 335.50 µs 344.31 µs]
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe
path async futures time: [327.87 µs 337.28 µs 346.35 µs]
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe
path async smol time: [331.01 µs 362.15 µs 393.84 µs]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
path async tokio time: [319.19 µs 328.30 µs 337.53 µs]
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe
source for the bench
use criterion::async_executor::{FuturesExecutor, SmolExecutor};
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use glam::Vec2;
use polyanya::{Mesh, PolyanyaFile};
use rand::{rngs::StdRng, Rng, SeedableRng};
fn path_sync_vs_async(c: &mut Criterion) {
let mesh: Mesh = PolyanyaFile::from_file("meshes/aurora-merged.mesh").into();
let mesh_min_x = mesh
.vertices
.iter()
.map(|v| v.coords.x)
.fold(f32::INFINITY, |a, b| a.min(b));
let mesh_max_x = mesh
.vertices
.iter()
.map(|v| v.coords.x)
.fold(f32::NEG_INFINITY, |a, b| a.max(b));
let mesh_min_y = mesh
.vertices
.iter()
.map(|v| v.coords.y)
.fold(f32::INFINITY, |a, b| a.min(b));
let mesh_max_y = mesh
.vertices
.iter()
.map(|v| v.coords.y)
.fold(f32::NEG_INFINITY, |a, b| a.max(b));
let width = mesh_max_x - mesh_min_x;
let height = mesh_max_y - mesh_min_y;
let mut random = StdRng::from_seed([0; 32]);
c.bench_function(&"path sync".to_string(), |b| {
b.iter(|| {
let start_x = random.gen::<f32>() * width + mesh_min_x;
let start_y = random.gen::<f32>() * height + mesh_min_y;
let end_x = random.gen::<f32>() * width + mesh_min_x;
let end_y = random.gen::<f32>() * height + mesh_min_y;
black_box(mesh.path(Vec2::new(start_x, start_y), Vec2::new(end_x, end_y)))
})
});
// recreate random from the same seed
let mut random = StdRng::from_seed([0; 32]);
c.bench_function(&"path async futures".to_string(), |b| {
b.to_async(FuturesExecutor).iter(|| {
let start_x = random.gen::<f32>() * width + mesh_min_x;
let start_y = random.gen::<f32>() * height + mesh_min_y;
let end_x = random.gen::<f32>() * width + mesh_min_x;
let end_y = random.gen::<f32>() * height + mesh_min_y;
black_box(mesh.get_path(Vec2::new(start_x, start_y), Vec2::new(end_x, end_y)))
})
});
// recreate random from the same seed
let mut random = StdRng::from_seed([0; 32]);
c.bench_function(&"path async smol".to_string(), |b| {
b.to_async(SmolExecutor).iter(|| {
let start_x = random.gen::<f32>() * width + mesh_min_x;
let start_y = random.gen::<f32>() * height + mesh_min_y;
let end_x = random.gen::<f32>() * width + mesh_min_x;
let end_y = random.gen::<f32>() * height + mesh_min_y;
black_box(mesh.get_path(Vec2::new(start_x, start_y), Vec2::new(end_x, end_y)))
})
});
// recreate random from the same seed
let mut random = StdRng::from_seed([0; 32]);
c.bench_function(&"path async tokio".to_string(), |b| {
let rt = tokio::runtime::Runtime::new().unwrap();
b.to_async(rt).iter(|| {
let start_x = random.gen::<f32>() * width + mesh_min_x;
let start_y = random.gen::<f32>() * height + mesh_min_y;
let end_x = random.gen::<f32>() * width + mesh_min_x;
let end_y = random.gen::<f32>() * height + mesh_min_y;
black_box(mesh.get_path(Vec2::new(start_x, start_y), Vec2::new(end_x, end_y)))
})
});
}
criterion_group!(benches, path_sync_vs_async);
criterion_main!(benches);
sync and async with the futures crate seem to have the same kind of performance, a tiny bit slower with smog, and a tiny bit faster with tokio.
I don't have a good explanation, maybe tokio is better at small tasks?
My bad! I wasn't awaiting the join_all properly in my test! I would consider the issue resolved 👍🏻
One last question I have though: what's the idea behind the async endpoint? Is it better than just quickly creating a task and calculating a path inside? 🤔 I think it's the first time I see an async endpoint in any kind of pathfinding library, so I am curious!
One last question I have though: what's the idea behind the async endpoint? Is it better than just quickly creating a task and calculating a path inside?
The async endpoint returns Poll::Pending periodically, allowing interrupting the task at an async-scheduler level as well as at the OS-thread level, allowing for cooperative multitasking.
Wrapping a normal function in a task won't allow for interrupting the task as the task is ongoing.
As I'm using Polyanya for video game, blocking on something should be avoided. In Bevy I still use the blocking version in an async task, but that means that I could be limited in the number of pathfinding tasks that can run in parallel.
That said, it's mostly a theoretical point, as I haven't had paths that took long enough to compute to actually block other tasks, but as the pathfinding algorithm has a clear "step" definition that allow to return control to the OS and I like async Rust, I wanted to try to expose an API with it.
when running many pathfinding tasks in parallel:
- with the blocking API, they'll wait longer to start, but finish faster once started
- with the async API, they'll start sooner, but finish slower
The async API could also be friendlier to other tasks if you have a lot of other things running in parallel