polyanya icon indicating copy to clipboard operation
polyanya copied to clipboard

Blocking vs async performance

Open skyne98 opened this issue 2 years ago • 2 comments

Hi! Recently did a small test to measure the performance and got very surprising results. (on aurora merged)

Average time per 1000 runs per 34707 verts: 336ms 231us 70ns
Average time per 1000 runs per 1000 verts: 9ms 889us 149ns
Average time per 1000 runs per 34707 verts (async): 217us 79ns
Average time per 1000 runs per 1000 verts (async): 6us 384ns

Can someone explain if maybe I did something wrong (see code below) or how the async version is so much faster? It doesn't look multithreaded, so how?

#![feature(core_intrinsics)]

use std::sync::Arc;

use glam::Vec2;
use humantime::format_duration;
use polyanya::{Mesh, PolyanyaFile};
use tokio::sync::RwLock;

fn run_and_measure<F: FnOnce() -> R, R>(f: F) -> (R, std::time::Duration) {
    let start = std::time::Instant::now();
    let result = f();
    let elapsed = start.elapsed();
    (result, elapsed)
}

#[tokio::main]
async fn main() {
    let mesh: Mesh = PolyanyaFile::from_file("meshes/aurora-merged.mesh").into();
    let mesh_min_x = mesh
        .vertices
        .iter()
        .map(|v| v.coords.x)
        .fold(f32::INFINITY, |a, b| a.min(b));
    let mesh_max_x = mesh
        .vertices
        .iter()
        .map(|v| v.coords.x)
        .fold(f32::NEG_INFINITY, |a, b| a.max(b));
    let mesh_min_y = mesh
        .vertices
        .iter()
        .map(|v| v.coords.y)
        .fold(f32::INFINITY, |a, b| a.min(b));
    let mesh_max_y = mesh
        .vertices
        .iter()
        .map(|v| v.coords.y)
        .fold(f32::NEG_INFINITY, |a, b| a.max(b));
    let width = mesh_max_x - mesh_min_x;
    let height = mesh_max_y - mesh_min_y;

    let runs = 10_000;
    let duration = run_and_measure(|| {
        for _ in 0..runs {
            let start_x = rand::random::<f32>() * width + mesh_min_x;
            let start_y = rand::random::<f32>() * height + mesh_min_y;
            let end_x = rand::random::<f32>() * width + mesh_min_x;
            let end_y = rand::random::<f32>() * height + mesh_min_y;

            std::intrinsics::black_box(
                mesh.path(Vec2::new(start_x, start_y), Vec2::new(end_x, end_y)),
            );
        }
    });

    let duration_per_1000_runs = duration.1 / (runs / 1_000);
    let vertices = mesh.vertices.len();
    println!(
        "Average time per 1000 runs per {} verts: {}",
        vertices,
        format_duration(duration_per_1000_runs).to_string()
    );

    let duration_per_1000_runs_per_1000_verts = duration_per_1000_runs / (vertices as u32 / 1_000);
    println!(
        "Average time per 1000 runs per 1000 verts: {}",
        format_duration(duration_per_1000_runs_per_1000_verts).to_string()
    );

    // Now try the async version
    let runs = 1_000_000;
    let rw_lock_mesh = Arc::new(RwLock::new(mesh));
    let duration = run_and_measure(|| {
        let mut futures = vec![];
        for _ in 0..runs {
            let start_x = rand::random::<f32>() * width + mesh_min_x;
            let start_y = rand::random::<f32>() * height + mesh_min_y;
            let end_x = rand::random::<f32>() * width + mesh_min_x;
            let end_y = rand::random::<f32>() * height + mesh_min_y;

            let rw_lock_mesh = rw_lock_mesh.clone();
            futures.push(async move {
                let mesh = rw_lock_mesh.read().await;
                std::intrinsics::black_box(
                    mesh.get_path(Vec2::new(start_x, start_y), Vec2::new(end_x, end_y))
                        .await,
                );
            });
        }
        futures::future::join_all(futures)
    });

    let duration_per_1000_runs = duration.1 / (runs / 1_000);
    println!(
        "Average time per 1000 runs per {} verts (async): {}",
        vertices,
        format_duration(duration_per_1000_runs).to_string()
    );

    let duration_per_1000_runs_per_1000_verts = duration_per_1000_runs / (vertices as u32 / 1_000);
    println!(
        "Average time per 1000 runs per 1000 verts (async): {}",
        format_duration(duration_per_1000_runs_per_1000_verts).to_string()
    );
}

Thanks for the answer in advance! Very interested in the details :P

skyne98 avatar Feb 05 '24 14:02 skyne98

I tried in a benchmark using criterion

path sync           time:   [326.31 µs 335.50 µs 344.31 µs]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

path async futures  time:   [327.87 µs 337.28 µs 346.35 µs]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

path async smol     time:   [331.01 µs 362.15 µs 393.84 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

path async tokio    time:   [319.19 µs 328.30 µs 337.53 µs]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
source for the bench
use criterion::async_executor::{FuturesExecutor, SmolExecutor};
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use glam::Vec2;
use polyanya::{Mesh, PolyanyaFile};
use rand::{rngs::StdRng, Rng, SeedableRng};

fn path_sync_vs_async(c: &mut Criterion) {
    let mesh: Mesh = PolyanyaFile::from_file("meshes/aurora-merged.mesh").into();
    let mesh_min_x = mesh
        .vertices
        .iter()
        .map(|v| v.coords.x)
        .fold(f32::INFINITY, |a, b| a.min(b));
    let mesh_max_x = mesh
        .vertices
        .iter()
        .map(|v| v.coords.x)
        .fold(f32::NEG_INFINITY, |a, b| a.max(b));
    let mesh_min_y = mesh
        .vertices
        .iter()
        .map(|v| v.coords.y)
        .fold(f32::INFINITY, |a, b| a.min(b));
    let mesh_max_y = mesh
        .vertices
        .iter()
        .map(|v| v.coords.y)
        .fold(f32::NEG_INFINITY, |a, b| a.max(b));
    let width = mesh_max_x - mesh_min_x;
    let height = mesh_max_y - mesh_min_y;

    let mut random = StdRng::from_seed([0; 32]);
    c.bench_function(&"path sync".to_string(), |b| {
        b.iter(|| {
            let start_x = random.gen::<f32>() * width + mesh_min_x;
            let start_y = random.gen::<f32>() * height + mesh_min_y;
            let end_x = random.gen::<f32>() * width + mesh_min_x;
            let end_y = random.gen::<f32>() * height + mesh_min_y;

            black_box(mesh.path(Vec2::new(start_x, start_y), Vec2::new(end_x, end_y)))
        })
    });

    // recreate random from the same seed
    let mut random = StdRng::from_seed([0; 32]);
    c.bench_function(&"path async futures".to_string(), |b| {
        b.to_async(FuturesExecutor).iter(|| {
            let start_x = random.gen::<f32>() * width + mesh_min_x;
            let start_y = random.gen::<f32>() * height + mesh_min_y;
            let end_x = random.gen::<f32>() * width + mesh_min_x;
            let end_y = random.gen::<f32>() * height + mesh_min_y;

            black_box(mesh.get_path(Vec2::new(start_x, start_y), Vec2::new(end_x, end_y)))
        })
    });

    // recreate random from the same seed
    let mut random = StdRng::from_seed([0; 32]);
    c.bench_function(&"path async smol".to_string(), |b| {
        b.to_async(SmolExecutor).iter(|| {
            let start_x = random.gen::<f32>() * width + mesh_min_x;
            let start_y = random.gen::<f32>() * height + mesh_min_y;
            let end_x = random.gen::<f32>() * width + mesh_min_x;
            let end_y = random.gen::<f32>() * height + mesh_min_y;

            black_box(mesh.get_path(Vec2::new(start_x, start_y), Vec2::new(end_x, end_y)))
        })
    });

    // recreate random from the same seed
    let mut random = StdRng::from_seed([0; 32]);
    c.bench_function(&"path async tokio".to_string(), |b| {
        let rt = tokio::runtime::Runtime::new().unwrap();
        b.to_async(rt).iter(|| {
            let start_x = random.gen::<f32>() * width + mesh_min_x;
            let start_y = random.gen::<f32>() * height + mesh_min_y;
            let end_x = random.gen::<f32>() * width + mesh_min_x;
            let end_y = random.gen::<f32>() * height + mesh_min_y;

            black_box(mesh.get_path(Vec2::new(start_x, start_y), Vec2::new(end_x, end_y)))
        })
    });
}

criterion_group!(benches, path_sync_vs_async);
criterion_main!(benches);

sync and async with the futures crate seem to have the same kind of performance, a tiny bit slower with smog, and a tiny bit faster with tokio.

I don't have a good explanation, maybe tokio is better at small tasks?

mockersf avatar Feb 05 '24 17:02 mockersf

My bad! I wasn't awaiting the join_all properly in my test! I would consider the issue resolved 👍🏻 One last question I have though: what's the idea behind the async endpoint? Is it better than just quickly creating a task and calculating a path inside? 🤔 I think it's the first time I see an async endpoint in any kind of pathfinding library, so I am curious!

skyne98 avatar Feb 06 '24 15:02 skyne98

One last question I have though: what's the idea behind the async endpoint? Is it better than just quickly creating a task and calculating a path inside?

The async endpoint returns Poll::Pending periodically, allowing interrupting the task at an async-scheduler level as well as at the OS-thread level, allowing for cooperative multitasking.

Wrapping a normal function in a task won't allow for interrupting the task as the task is ongoing.

gmorenz avatar Jun 24 '24 14:06 gmorenz

As I'm using Polyanya for video game, blocking on something should be avoided. In Bevy I still use the blocking version in an async task, but that means that I could be limited in the number of pathfinding tasks that can run in parallel.

That said, it's mostly a theoretical point, as I haven't had paths that took long enough to compute to actually block other tasks, but as the pathfinding algorithm has a clear "step" definition that allow to return control to the OS and I like async Rust, I wanted to try to expose an API with it.

when running many pathfinding tasks in parallel:

  • with the blocking API, they'll wait longer to start, but finish faster once started
  • with the async API, they'll start sooner, but finish slower

The async API could also be friendlier to other tasks if you have a lot of other things running in parallel

mockersf avatar Jun 29 '24 08:06 mockersf