Effect of dynamicScene on World with many instances
I was testing with a scene that has roughly 3 million triangle Mesh instances placed in a large box-shaped area. The instanced geometry is fairly simple, 1280 triangles in a disc-shaped object and should be fairly easy for a BVH builder to handle as the instances can be readily separated in non-overlapping groups.
Looking at the scene build statistics, particularly word.commit() and the Embree BVH build stats (also see patch all below), it takes around 900 ms. Having written a bit of BVH code in the past this feels like it could be done faster :-) Particularly since Embree is using multi-threaded construction and is undoubtedly heavily optimized over the years. So this made me go looking into the effect of using the dynamicScene flag on the World and I find it barely makes a difference for this use case. Here's some statistics of several runs:
dynamicScene=true
finished BVH8<instance> : 902.191ms, 3.32524 Mprim/s, 0.292055 GB/s
finished BVH8<instance> : 917.183ms, 3.27089 Mprim/s, 0.287282 GB/s
finished BVH8<instance> : 902.855ms, 3.32279 Mprim/s, 0.291841 GB/s
finished BVH8<instance> : 923.098ms, 3.24993 Mprim/s, 0.285441 GB/s
finished BVH8<instance> : 921.61ms, 3.25517 Mprim/s, 0.285902 GB/s
primitives = 3000000, vertices = 0, depth = 9
total : sah = 377.286 (100.00%), #bytes = 262.94 MB (100.00%), #nodes = 3839614 ( 70.39% filled), #bytes/prim = 87.65
getAABBNodes : sah = 245.430 ( 65.05%), #bytes = 214.94 MB ( 81.74%), #nodes = 839614 ( 57.16% filled), #bytes/prim = 71.65
leaves : sah = 131.856 ( 34.95%), #bytes = 48.00 MB ( 18.26%), #nodes = 3000000 (100.00% filled), #bytes/prim = 16.00
histogram : 100.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
alloc : used = 263.490 MB, #bytes/prim = 87.83
alloc : used = 263.490 MB, free = 0.020 MB, wasted = 3.386 MB, total = 266.896 MB, #bytes/prim = 88.97
total : used = 266.896 MB, free = 3.628 MB, wasted = 0.017 MB, total = 270.541 MB, #bytes/prim = 90.18
4K : used = 0.000 MB, free = 0.000 MB, wasted = 0.000 MB, total = 0.000 MB, #bytes/prim = 0.00
2M : used = 0.000 MB, free = 0.000 MB, wasted = 0.000 MB, total = 0.000 MB, #bytes/prim = 0.00
malloc: used = 266.896 MB, free = 3.628 MB, wasted = 0.017 MB, total = 270.541 MB, #bytes/prim = 90.18
shared: used = 0.000 MB, free = 0.000 MB, wasted = 0.000 MB, total = 0.000 MB, #bytes/prim = 0.00
dynamicScene=false
finished BVH8<instance> : 912.671ms, 3.28705 Mprim/s, 0.288702 GB/s
finished BVH8<instance> : 913.075ms, 3.2856 Mprim/s, 0.288574 GB/s
finished BVH8<instance> : 910.718ms, 3.2941 Mprim/s, 0.289321 GB/s
finished BVH8<instance> : 991.954ms, 3.02433 Mprim/s, 0.265627 GB/s
finished BVH8<instance> : 906.876ms, 3.30806 Mprim/s, 0.290547 GB/s
primitives = 3000000, vertices = 0, depth = 9
total : sah = 377.286 (100.00%), #bytes = 262.94 MB (100.00%), #nodes = 3839614 ( 70.39% filled), #bytes/prim = 87.65
getAABBNodes : sah = 245.430 ( 65.05%), #bytes = 214.94 MB ( 81.74%), #nodes = 839614 ( 57.16% filled), #bytes/prim = 71.65
leaves : sah = 131.856 ( 34.95%), #bytes = 48.00 MB ( 18.26%), #nodes = 3000000 (100.00% filled), #bytes/prim = 16.00
histogram : 100.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
alloc : used = 263.490 MB, #bytes/prim = 87.83
alloc : used = 263.490 MB, free = 0.015 MB, wasted = 3.387 MB, total = 266.891 MB, #bytes/prim = 88.96
total : used = 266.891 MB, free = 1.536 MB, wasted = 0.008 MB, total = 268.435 MB, #bytes/prim = 89.48
4K : used = 266.891 MB, free = 1.536 MB, wasted = 0.008 MB, total = 268.435 MB, #bytes/prim = 89.48
2M : used = 0.000 MB, free = 0.000 MB, wasted = 0.000 MB, total = 0.000 MB, #bytes/prim = 0.00
malloc: used = 0.000 MB, free = 0.000 MB, wasted = 0.000 MB, total = 0.000 MB, #bytes/prim = 0.00
shared: used = 0.000 MB, free = 0.000 MB, wasted = 0.000 MB, total = 0.000 MB, #bytes/prim = 0.00
It's hard to detect any pattern here, which is really surprising (the largest difference I see if the malloc verus 4K usage, but I'm not even sure this is caused by the dynamicScene flag). I checked my code to see I'm setting the flag correctly, committing the world, etc, but I don't see anything obviously wrong.
What I'm wondering about is the relevant OSPRay code. https://github.com/ospray/ospray/blob/fdda0889f9143a8b20f26389c22d1691f1a6a527/modules/cpu/common/World.cpp#L89-L90 sets only RTC_SCENE_FLAG_DYNAMIC. The OSPRay docs mention RTC_SCENE_DYNAMIC in the description of dynamicScene, but that appears to be the name of the Embree v2 parameter? E.g. looking at https://github.com/embree/embree/blob/489b746c0d5010e0da10345e9dc96768bec9a037/scripts/embree2_to_embree3.patch#L422-L424 it seems the equivalent Embree v3 parameter would to also enable RTC_BUILD_QUALITY_LOW.
Hacking in the extra RTC_BUILD_QUALITY_LOW in the OSPRay sources of World commit indeed makes a noticeable difference:
# dynamicScene = RTC_SCENE_FLAG_DYNAMIC|RTC_BUILD_QUALITY_LOW
finished BVH8<instance> : 821.72ms, 3.65088 Mprim/s, 0.320656 GB/s
finished BVH8<instance> : 829.221ms, 3.61785 Mprim/s, 0.317756 GB/s
finished BVH8<instance> : 827.955ms, 3.62339 Mprim/s, 0.318242 GB/s
finished BVH8<instance> : 820.362ms, 3.65692 Mprim/s, 0.321187 GB/s
finished BVH8<instance> : 862.971ms, 3.47636 Mprim/s, 0.305329 GB/s
primitives = 3000000, vertices = 0, depth = 9
total : sah = 377.286 (100.00%), #bytes = 262.94 MB (100.00%), #nodes = 3839614 ( 70.39% filled), #bytes/prim = 87.65
getAABBNodes : sah = 245.430 ( 65.05%), #bytes = 214.94 MB ( 81.74%), #nodes = 839614 ( 57.16% filled), #bytes/prim = 71.65
leaves : sah = 131.856 ( 34.95%), #bytes = 48.00 MB ( 18.26%), #nodes = 3000000 (100.00% filled), #bytes/prim = 16.00
histogram : 100.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
alloc : used = 263.490 MB, #bytes/prim = 87.83
alloc : used = 263.490 MB, free = 0.011 MB, wasted = 3.387 MB, total = 266.888 MB, #bytes/prim = 88.96
total : used = 266.888 MB, free = 3.637 MB, wasted = 0.017 MB, total = 270.541 MB, #bytes/prim = 90.18
4K : used = 0.000 MB, free = 0.000 MB, wasted = 0.000 MB, total = 0.000 MB, #bytes/prim = 0.00
2M : used = 0.000 MB, free = 0.000 MB, wasted = 0.000 MB, total = 0.000 MB, #bytes/prim = 0.00
malloc: used = 266.888 MB, free = 3.637 MB, wasted = 0.017 MB, total = 270.541 MB, #bytes/prim = 90.18
shared: used = 0.000 MB, free = 0.000 MB, wasted = 0.000 MB, total = 0.000 MB, #bytes/prim = 0.00
So this makes me wonder if the current RTC_SCENE_DYNAMIC flag alone provides enough benefit, and if perhaps extra control over the BVH build quality is also interesting? I'd also be curious if the initial question of BVH build performance for the number of instances used here is typical?
Btw, as far as I could tell you can only get Embree BVH status when using --osp:debug, but that forces single-threading, even for BVH construction it seems. So I applied this patch to add direct control over Embree verbosity independent of the debug flag:
diff --git a/ospray/api/Device.cpp b/ospray/api/Device.cpp
index 10158b17c..410037c90 100644
--- a/ospray/api/Device.cpp
+++ b/ospray/api/Device.cpp
@@ -88,6 +88,9 @@ void Device::commit()
logLevel = logLevelFromString(logLevelStr).value_or(logLevel);
+ auto OSPRAY_EMBREE_VERBOSITY = utility::getEnvVar<int>("OSPRAY_EMBREE_VERBOSITY");
+ embreeVerbosity = OSPRAY_EMBREE_VERBOSITY.value_or(getParam<int>("embreeVerbosity", 0));
+
auto OSPRAY_NUM_THREADS = utility::getEnvVar<int>("OSPRAY_NUM_THREADS");
numThreads = OSPRAY_NUM_THREADS.value_or(getParam<int>("numThreads", -1));
@@ -151,7 +154,9 @@ std::string generateEmbreeDeviceCfg(const Device &device)
{
std::stringstream embreeConfig;
- if (device.debugMode)
+ if (device.embreeVerbosity > 0)
+ embreeConfig << (" verbose=" + std::to_string(device.embreeVerbosity));
+ else if (device.debugMode)
embreeConfig << " verbose=2";
if (device.threadAffinity == api::Device::AFFINITIZE)
diff --git a/ospray/api/Device.h b/ospray/api/Device.h
index f14c30562..3d35d562b 100644
--- a/ospray/api/Device.h
+++ b/ospray/api/Device.h
@@ -150,6 +150,7 @@ struct OSPRAY_CORE_INTERFACE Device : public memory::RefCountedObject,
int numThreads{-1};
bool debugMode{false};
bool apiTraceEnabled{false};
+ int embreeVerbosity{0};
enum OSP_THREAD_AFFINITY
{