polyanya Blocking vs async performance

Hi! Recently did a small test to measure the performance and got very surprising results. (on aurora merged)

Average time per 1000 runs per 34707 verts: 336ms 231us 70ns
Average time per 1000 runs per 1000 verts: 9ms 889us 149ns
Average time per 1000 runs per 34707 verts (async): 217us 79ns
Average time per 1000 runs per 1000 verts (async): 6us 384ns

Can someone explain if maybe I did something wrong (see code below) or how the async version is so much faster? It doesn't look multithreaded, so how?

#![feature(core_intrinsics)]

use std::sync::Arc;

use glam::Vec2;
use humantime::format_duration;
use polyanya::{Mesh, PolyanyaFile};
use tokio::sync::RwLock;

fn run_and_measure<F: FnOnce() -> R, R>(f: F) -> (R, std::time::Duration) {
    let start = std::time::Instant::now();
    let result = f();
    let elapsed = start.elapsed();
    (result, elapsed)
}

#[tokio::main]
async fn main() {
    let mesh: Mesh = PolyanyaFile::from_file("meshes/aurora-merged.mesh").into();
    let mesh_min_x = mesh
        .vertices
        .iter()
        .map(|v| v.coords.x)
        .fold(f32::INFINITY, |a, b| a.min(b));
    let mesh_max_x = mesh
        .vertices
        .iter()
        .map(|v| v.coords.x)
        .fold(f32::NEG_INFINITY, |a, b| a.max(b));
    let mesh_min_y = mesh
        .vertices
        .iter()
        .map(|v| v.coords.y)
        .fold(f32::INFINITY, |a, b| a.min(b));
    let mesh_max_y = mesh
        .vertices
        .iter()
        .map(|v| v.coords.y)
        .fold(f32::NEG_INFINITY, |a, b| a.max(b));
    let width = mesh_max_x - mesh_min_x;
    let height = mesh_max_y - mesh_min_y;

    let runs = 10_000;
    let duration = run_and_measure(|| {
        for _ in 0..runs {
            let start_x = rand::random::<f32>() * width + mesh_min_x;
            let start_y = rand::random::<f32>() * height + mesh_min_y;
            let end_x = rand::random::<f32>() * width + mesh_min_x;
            let end_y = rand::random::<f32>() * height + mesh_min_y;

            std::intrinsics::black_box(
                mesh.path(Vec2::new(start_x, start_y), Vec2::new(end_x, end_y)),
            );
        }
    });

    let duration_per_1000_runs = duration.1 / (runs / 1_000);
    let vertices = mesh.vertices.len();
    println!(
        "Average time per 1000 runs per {} verts: {}",
        vertices,
        format_duration(duration_per_1000_runs).to_string()
    );

    let duration_per_1000_runs_per_1000_verts = duration_per_1000_runs / (vertices as u32 / 1_000);
    println!(
        "Average time per 1000 runs per 1000 verts: {}",
        format_duration(duration_per_1000_runs_per_1000_verts).to_string()
    );

    // Now try the async version
    let runs = 1_000_000;
    let rw_lock_mesh = Arc::new(RwLock::new(mesh));
    let duration = run_and_measure(|| {
        let mut futures = vec![];
        for _ in 0..runs {
            let start_x = rand::random::<f32>() * width + mesh_min_x;
            let start_y = rand::random::<f32>() * height + mesh_min_y;
            let end_x = rand::random::<f32>() * width + mesh_min_x;
            let end_y = rand::random::<f32>() * height + mesh_min_y;

            let rw_lock_mesh = rw_lock_mesh.clone();
            futures.push(async move {
                let mesh = rw_lock_mesh.read().await;
                std::intrinsics::black_box(
                    mesh.get_path(Vec2::new(start_x, start_y), Vec2::new(end_x, end_y))
                        .await,
                );
            });
        }
        futures::future::join_all(futures)
    });

    let duration_per_1000_runs = duration.1 / (runs / 1_000);
    println!(
        "Average time per 1000 runs per {} verts (async): {}",
        vertices,
        format_duration(duration_per_1000_runs).to_string()
    );

    let duration_per_1000_runs_per_1000_verts = duration_per_1000_runs / (vertices as u32 / 1_000);
    println!(
        "Average time per 1000 runs per 1000 verts (async): {}",
        format_duration(duration_per_1000_runs_per_1000_verts).to_string()
    );
}

Thanks for the answer in advance! Very interested in the details :P

Feb 05 '24 14:02 skyne98

I tried in a benchmark using criterion

path sync           time:   [326.31 µs 335.50 µs 344.31 µs]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

path async futures  time:   [327.87 µs 337.28 µs 346.35 µs]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

path async smol     time:   [331.01 µs 362.15 µs 393.84 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

path async tokio    time:   [319.19 µs 328.30 µs 337.53 µs]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

source for the bench

use criterion::async_executor::{FuturesExecutor, SmolExecutor};
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use glam::Vec2;
use polyanya::{Mesh, PolyanyaFile};
use rand::{rngs::StdRng, Rng, SeedableRng};

fn path_sync_vs_async(c: &mut Criterion) {
    let mesh: Mesh = PolyanyaFile::from_file("meshes/aurora-merged.mesh").into();
    let mesh_min_x = mesh
        .vertices
        .iter()
        .map(|v| v.coords.x)
        .fold(f32::INFINITY, |a, b| a.min(b));
    let mesh_max_x = mesh
        .vertices
        .iter()
        .map(|v| v.coords.x)
        .fold(f32::NEG_INFINITY, |a, b| a.max(b));
    let mesh_min_y = mesh
        .vertices
        .iter()
        .map(|v| v.coords.y)
        .fold(f32::INFINITY, |a, b| a.min(b));
    let mesh_max_y = mesh
        .vertices
        .iter()
        .map(|v| v.coords.y)
        .fold(f32::NEG_INFINITY, |a, b| a.max(b));
    let width = mesh_max_x - mesh_min_x;
    let height = mesh_max_y - mesh_min_y;

    let mut random = StdRng::from_seed([0; 32]);
    c.bench_function(&"path sync".to_string(), |b| {
        b.iter(|| {
            let start_x = random.gen::<f32>() * width + mesh_min_x;
            let start_y = random.gen::<f32>() * height + mesh_min_y;
            let end_x = random.gen::<f32>() * width + mesh_min_x;
            let end_y = random.gen::<f32>() * height + mesh_min_y;

            black_box(mesh.path(Vec2::new(start_x, start_y), Vec2::new(end_x, end_y)))
        })
    });

    // recreate random from the same seed
    let mut random = StdRng::from_seed([0; 32]);
    c.bench_function(&"path async futures".to_string(), |b| {
        b.to_async(FuturesExecutor).iter(|| {
            let start_x = random.gen::<f32>() * width + mesh_min_x;
            let start_y = random.gen::<f32>() * height + mesh_min_y;
            let end_x = random.gen::<f32>() * width + mesh_min_x;
            let end_y = random.gen::<f32>() * height + mesh_min_y;

            black_box(mesh.get_path(Vec2::new(start_x, start_y), Vec2::new(end_x, end_y)))
        })
    });

    // recreate random from the same seed
    let mut random = StdRng::from_seed([0; 32]);
    c.bench_function(&"path async smol".to_string(), |b| {
        b.to_async(SmolExecutor).iter(|| {
            let start_x = random.gen::<f32>() * width + mesh_min_x;
            let start_y = random.gen::<f32>() * height + mesh_min_y;
            let end_x = random.gen::<f32>() * width + mesh_min_x;
            let end_y = random.gen::<f32>() * height + mesh_min_y;

            black_box(mesh.get_path(Vec2::new(start_x, start_y), Vec2::new(end_x, end_y)))
        })
    });

    // recreate random from the same seed
    let mut random = StdRng::from_seed([0; 32]);
    c.bench_function(&"path async tokio".to_string(), |b| {
        let rt = tokio::runtime::Runtime::new().unwrap();
        b.to_async(rt).iter(|| {
            let start_x = random.gen::<f32>() * width + mesh_min_x;
            let start_y = random.gen::<f32>() * height + mesh_min_y;
            let end_x = random.gen::<f32>() * width + mesh_min_x;
            let end_y = random.gen::<f32>() * height + mesh_min_y;

            black_box(mesh.get_path(Vec2::new(start_x, start_y), Vec2::new(end_x, end_y)))
        })
    });
}

criterion_group!(benches, path_sync_vs_async);
criterion_main!(benches);

sync and async with the futures crate seem to have the same kind of performance, a tiny bit slower with smog, and a tiny bit faster with tokio.

I don't have a good explanation, maybe tokio is better at small tasks?

Feb 05 '24 17:02 mockersf

My bad! I wasn't awaiting the join_all properly in my test! I would consider the issue resolved 👍🏻 One last question I have though: what's the idea behind the async endpoint? Is it better than just quickly creating a task and calculating a path inside? 🤔 I think it's the first time I see an async endpoint in any kind of pathfinding library, so I am curious!

Feb 06 '24 15:02 skyne98

One last question I have though: what's the idea behind the async endpoint? Is it better than just quickly creating a task and calculating a path inside?

The async endpoint returns Poll::Pending periodically, allowing interrupting the task at an async-scheduler level as well as at the OS-thread level, allowing for cooperative multitasking.

Wrapping a normal function in a task won't allow for interrupting the task as the task is ongoing.

Jun 24 '24 14:06 gmorenz

As I'm using Polyanya for video game, blocking on something should be avoided. In Bevy I still use the blocking version in an async task, but that means that I could be limited in the number of pathfinding tasks that can run in parallel.

That said, it's mostly a theoretical point, as I haven't had paths that took long enough to compute to actually block other tasks, but as the pathfinding algorithm has a clear "step" definition that allow to return control to the OS and I like async Rust, I wanted to try to expose an API with it.

when running many pathfinding tasks in parallel:

with the blocking API, they'll wait longer to start, but finish faster once started
with the async API, they'll start sooner, but finish slower

The async API could also be friendlier to other tasks if you have a lot of other things running in parallel

Jun 29 '24 08:06 mockersf