GPU API Feedback
Hi!
I saw the GPU API push and have a bit of feedback to provide:
Unsupported vertex formats
SDL_GPUVERTFMT_CHAR
SDL_GPUVERTFMT_UCHAR
SDL_GPUVERTFMT_UCHAR3
SDL_GPUVERTFMT_CHAR3
SDL_GPUVERTFMT_USHORT3
SDL_GPUVERTFMT_SHORT3
SDL_GPUVERTFMT_HALF3
Get rid of them (including the NORMALIZED variants). GPU support is flaky for them (or non-existent) and requires expensive emulation, which is hard to get right due to alignment issues.
Merging format enums?
typedef enum SDL_GpuVertexFormat
typedef enum SDL_GpuPixelFormat
The trend is to unify these two; since supported GpuVertexFormat is just a subset of GpuPixelFormat
But personally I'm not decided for either option. They have pros and cons. From a low API perspective, unification makes sense. At a higher level? I don't know.
Format enum
Vulkan, DXGI and MTLPixelFormat formats are awfully similar.
In my experience this has been the best way to encode formats and intention (compared to alternatives like e.g. storing bits and shifts like DDS did back in the D3D7/DirectDraw days)
Personally I prefer the short versions, e.g. RGBA8_UNORM is just better than RGBA8888_UNORM when all formats have the same bit. Likewise RGB10A2 over RGBA1010102.
DXGI uses TYPELESS to indicate the format can be reinterpreted. Vulkan & Metal have no such format. In OgreNext we approach that by having a getFamily function:
SDL_GpuPixelFormat getFamily( SDL_GpuPixelFormat f );
This format returns the same format for all the variations, in order to simplify many operations (e.g. RGBA8_UNORM and RGBA8 are exactly the same in all ways except in how the data is returned when viewed by a shader).
More flags
typedef enum SDL_GpuTextureUsage
{
SDL_GPUTEXUSAGE_SHADERREAD,
SDL_GPUTEXUSAGE_SHADERWRITE,
SDL_GPUTEXUSAGE_RENDERTARGET
} SDL_GpuTextureUsage;
You'll want to include a few more flags such as:
-
SDL_GPUTEXUSAGE_MIPMAPGEN. When user wants SDL to generate Mipmaps, on D3D11 this meansRENDERTARGETflag must be added so D3D11'sGenerateMipmapscan be used. But on D3D12, it meansSHADERWRITEmust be added so a compute shader can generate the mipmaps. -
SDL_GPUTEXUSAGE_NOT_SAMPLED
Almost all textures want to be sampled, but there's a few where the user intends to use only read and write operations, without sampling. Hence a flag with negation to indicate sampling is not necessary.
Or alternatively add SDL_GPUTEXUSAGE_SAMPLED and have the user almost always add this flag.
Although generally flags should express positive terms instead of negation; In terms of user friendliness I think this is an exception, i.e. SDL_GPUTEXUSAGE_NOT_SAMPLED makes sense over SDL_GPUTEXUSAGE_SAMPLED.
-
SDL_GPUTEXUSAGE_CUBE_ARRAY_2D
Indicates the CubeMap or CubemapArray wants to also be used as ARRAY_2D.
-
SDL_GPUTEXUSAGE_REINTERPRETABLE
Indicates the texture will later be reinterpreted as another format.
More Texture types
SDL_GPUTEXTYPE_1D,
SDL_GPUTEXTYPE_2D,
SDL_GPUTEXTYPE_CUBE,
SDL_GPUTEXTYPE_3D
There's a few types missing, like CUBE_ARRAY, 2D_ARRAY, etc.
Being very clear about 'depth'
typedef struct SDL_GpuTextureDescription
{
// ...
Uint32 width;
Uint32 height;
Uint32 depth;
Uint32 mipmap_levels;
// ...
}
Documentation needs to be very clear about how depth is interpreted:
- When
SDL_GPUTEXTYPE_CUBE, depth must be 6 or 1?- Personal recommendation: Depth must be 6 (see next item)
- When
SDL_GPUTEXTYPE_CUBE_ARRAY, depth must be multiple of 6? or should depth be 1, 2, 3 and is internally multiplied by 6?- Personal recommendation: Depth must be multiple of 6 (see next item)
- When
SDL_GPUTEXTYPE_CUBE_ARRAYbut the texture is reinterpreted toSDL_GPUTEXTYPE_2D_ARRAYand depth was 1 but internally multiplied by 6, is depth now 6x on the reinterpreted texture?- This is why I recommend to be multiple of 6 for
SDL_GPUTEXTYPE_CUBE_*. When reinterpreted toSDL_GPUTEXTYPE_2D_ARRAY, all code is consistent and there's no surprise. There's no need for special handling.
- This is why I recommend to be multiple of 6 for
I also recommend to rename depth to depthOrSlices. Its interpretation depends on whether the texture is of type SDL_GPUTEXTYPE_3D or not.
What's the difference between depth and slice?
- depth is halved when going to the next mip.
- slice is constant for all mips.
OpenGL
OpenGL will by far give you the biggest headaches and constraints in your API:
- OpenGL does not support samplers being separate from textures at the shader level
- This is hard to emulate. You have to analyze the shader and find all pair combinations of <texture, sampler> and then bind them all potentially binding the same texture multiple times. Then generate a GLSL shader that remaps all of that. So if the original shader uses 5 textures and 3 samplers, at best you will have to bind 5 textures (w/ 5 samplers) and at worst case 15 textures (w/ samplers)
- OpenGL threading support is bad. Really bad. The biggest problem in this area is when trying to map a GPU buffer from multiple threads. The best solution is to just disallow this.
- One possible solution is to run GL in its own thread (aka RHI thread as Unreal likes to call it)
- All API calls from any thread are submitted/redirected to the RHI thread
- Some API calls like Draw commands can return immediately, other API calls will stall until the RHI sees it and returns data (such as when mapping buffers)
- For performance this may mean some calls end up returning "OK" but later fail when the RHI sees it and tries to execute it. SDL could have a debug flag to wait for every RHI calls, and should be off in Release.
One-time mipmaps
There's mainly 2 cases for generating mipmaps on the fly:
- We're rendering to a texture, and need mipmaps for it.
- We'll be calling this often (e.g. once per frame or more)
- Textures are loading from disk. They were not bundled with mipmaps.
- We'll be generating mipmaps once. Then never again.
The first case is not an issue. The second case is. The problem is that the regular code:
- Create texture with
SDL_GPUTEXUSAGE_MIPMAPGEN - Load texture from disk
- Call
SDL_GpuGenerateMipmaps
is inefficient. Why? Because the SDL_GPUTEXUSAGE_MIPMAPGEN flag will stay forever which means the texture is internally either RENDERTARGET or SHADERWRITE when it doesn't have to, disabling lots of internal GPU optimizations.
The easiest way to achieve this (cross platform):
- Generate tmp texture
- Load data into tmp texture
- Generate mipmaps into tmp
- Copy results into final texture
- Destroy tmp (or keep it cached for reuse)
An API that is easy to use could provide this convenience, by handling this TMP texture internally when loading data from disk.
SDL_GpuPipelineDescription's attachments
SDL_GpuPipelineDescription contains SDL_GpuColorAttachmentDescription, which contains a SDL_GpuTexture.
This is wrong. What's needed in the Pipeline desc is its metadata (format, type, msaa settings, etc; except resolution). A structure should contain this data, and could be obtained via API:
SDL_GpuTextureMetadata *metadata = SDL_GetMetadata( SDL_GpuTexture *texture )
Actually SDL_GpuTextureMetadata = SDL_GpuTextureDescription but without the char *name.
or alternatively (perhaps this is even better):
SDL_GpuColorAttachmentDescriptionMeta *metaDescriptor = SDL_GetMeta( SDL_GpuColorAttachmentDescription *descriptor )
Hi, I would like to give my feedback as well, why separation of depth stencil state? This would work on metal but d3d12/vulkan wants all states for graphics pipeline creation, otherwise you'll need to cache - create pipelines on demand which can result on gpu slutter, so probably SDL_GpuPipelineDescription needs all members necessary to create vulkan/d3d12 pipeline.
Another topic I would like to bring attention is: Memory allocation, what strategy would be used? Or existing D3D12MA/VMA?
why separation of depth stencil state? This would work on metal
Metal was my starting point for designing this, specifically because I think it's the most app-friendly (at least in the API concepts, if we're willing to avoid debates about Metal Shader Language and Objective-C). But as @darksylinc's detailed notes also point out, there are several small things in Metal that are going to cause headaches elsewhere, and this is probably also a good example of that.
Some of these might be worth changing, some might be worth faking on this platform or that, and some are worth putting in the trash. Definitely point these things out and we'll make changes to the API as we go. We are in extremely early times for this project.
Memory allocation, what strategy would be used? Or existing D3D12MA/VMA?
Probably something like VMA behind the scenes, but my sincere hope is to hide this from the app if I can get away with it.
Unsupported vertex formats
Removed! This remaining list honestly still feels like it could lose a bunch more formats and no one would miss them, too.
The trend is to unify these two; since supported GpuVertexFormat is just a subset of GpuPixelFormat
But these aren't the same thing at all...? Maybe I'm misunderstanding, but if/when we add compressed texture formats, we're definitely going to need these to be separate, as they won't just be packed arrays of raw pixels anymore.
(I'll go through the rest of your comments in the morning, just didn't want you to think I was ignoring you!)
But these aren't the same thing at all...? Maybe I'm misunderstanding, but if/when we add compressed texture formats, we're definitely going to need these to be separate, as they won't just be packed arrays of raw pixels anymore.
I've been giving more thought about it and I agree it makes sense to keep them separate. That makes it less error prone, otherwise you can feed a GpuPixelFormat that is not a valid GpuVertexFormat.
From a low level perspective it makes sense, since like I said GpuVertexFormat is a subset of GpuPixelFormat thus it simplifies code (e.g. same conversion routines that work for RGBA8_UNORM work for UCHAR4_UNORM) and allows for potential expansion in the future (i.e. this has never happened though, like supporting RGBA565 or ASTC as a vertex format).
(I'll go through the rest of your comments in the morning, just didn't want you to think I was ignoring you!)
No prob and thanks for being considerate! :)
To revisit:
why separation of depth stencil state? This would work on metal
It looks like primitive topology needs to be specified in the PSO too in D3D12...my misreading of ID3D12GraphicsCommandList::IASetPrimitiveTopology made me think you could set whatever in the command buffer, but the PSO still wants to know if it's generally a point/line/triangle. :/
So we'll probably move SDL_GpuPrimitive into SDL_GpuPipelineDescription along with the depth stencil state, too.
- Generate tmp texture
- Load data into tmp texture
- Generate mipmaps into tmp
- Copy results into final texture
- Destroy tmp (or keep it cached for reuse)
I'm thinking we can do this in SDL_GpuGenerateMipmaps(), with the backend deciding whether a texture is ideal (it doesn't have SHADERWRITE on d3d12, etc, so it does the temporary texture proxy, otherwise it just does it directly on the existing texture)...it won't need a new SDL_GPUTEXUSAGE_MIPMAPGEN flag at all in this case, as SDL will handle this transparently.
If doing this transparently takes a performance hit, the primary use case would be loading textures without mipmaps from disk, which is generally considered a slow operation (and for large enough games, is probably tossed into a background thread).
But I might be oversimplifying things here...?
If doing this transparently takes a performance hit, the primary use case would be loading textures without mipmaps from disk, which is generally considered a slow operation (and for large enough games, is probably tossed into a background thread).
But I might be oversimplifying things here...?
No, that's a bad idea.
If I have a RENDERTARGET texture, and want to call SDL_GpuGenerateMipmaps every frame, on D3D11 it will work as expected. But on D3D12 it will use the tmp path every frame because it's missing SHADERWRITE,
And even if you keep the MIPMAPGEN flag, you have to consider the issue of silent user error: Users who will call SDL_GpuGenerateMipmaps every frame without the MIPMAPGEN flag would end up using the slow tmp copy path.
And users who who actually want to use the tmp copy path, may end up adding the MIPMAPGEN flag because they enabled it on purpose thinking that it is necessary when it's not.
To fix that silent error the API needs to be more explicit, like SDL_GpuGenerateMipmaps( bool bUserTmpCopyPath ). And the API will give a perf warning if bUserTmpCopyPath == true and MIPMAPGEN is present, and give an error if bUserTmpCopyPath == false and MIPMAPGEN is absent.
Or instead of adding bool bUserTmpCopyPath a new flag MIPMAPGEN_ON_LOAD (or something like that) is added. The flag indicates SDL_GpuGenerateMipmaps can be called on it, but the copy path should be used.
Update: This is the problem OpenGL has. GL lets you call glGenerateMipmaps on any texture at any time.
Very convenient.
But doing so means the driver must internally retransform the texture to add the RENDERTARGET flag (this is has been the cause of lots of driver bugs in the past) and once it's added it stays forever, which means you pay a hidden price just for calling the function; and only experts will tell you about this pitfall.
If I have a RENDERTARGET texture, and want to call SDL_GpuGenerateMipmaps every frame, on D3D11 it will work as expected. But on D3D12 it will use the tmp path every frame because it's missing SHADERWRITE,
oh, I see. Okay, let me think about this more.
OpenGL
These are all good points about OpenGL, btw, and it will likely work much like you described...but also I'm trying to decide what the surface area looks like for "people that have a modern version of OpenGL and don't have any other options," and I suspect it's pretty small, outside of WebGL in current times and some smaller devices that expect you to use GLES3...which is not to minimize it, but focusing on Vulkan, Direct3D and Metal backends first and then making GL4/GLES3/WebGL2 limp along later seems like a reasonable strategy, at least on paper.
SDL_GpuPipelineDescription contains SDL_GpuColorAttachmentDescription, which contains a SDL_GpuTexture.
It doesn't, I have two unrelated structs with the same name, which is obviously a different problem. :)
Latest revision moves depth stencil details into SDL_GpuPipelineDescription and fixes SDL_GpuColorAttachmentDescription naming.
what the surface area looks like for "people that have a modern version of OpenGL and don't have any other options,"
Do you mean as in "My HW or drivers are too old to run anything better" ?
If that's the case then:
- On Windows D3D11 is the best bet. You can use FL level 10.0 to target D3D10 HW which go all way back to 2006 with GeForce 8000 series, AMD Radeon HD 2000 (2007), and Intel Series 4 (Year 2008: Eaglelake's GMA X3100, GMA 4500, then HD Graphics).
- There will be a few restrictions that may apply on the features that can be used with FL 10. Good thing is that D3D11 will complain even in your most modern system if you just select FL 10.0
- On Windows GL is a terrible bet. The OpenGL drivers for most of these HW is too buggy and not updated anymore in years. Unless you use ANGLE, which will translate OpenGL to D3D11 for you (or in the future, Zink; but that will require Vulkan)
- Out of personal experience with Intel GPUs, Intel HD 4000 is usually the oldest that doesn't give a tremendous amounts of pain due to either SW or HW bugs. However technically if you target D3D11 FL 10 the older cards are going to work, just... expect bugs.
- On Linux there's the special case of AMD Radeon HD 2000 - 6000 and equivalent Intel GPUs that actually have decent OpenGL 3.x drivers or even "almost 4.x" (thanks to extensions) but not Vulkan. These users typically use Windows instead of Linux. The pool is extremely small but it does exist.
As for the market share in general, I wouldn't dismiss it. It won't and cannot run the latest AAA games, these machines don't appear in Steam survey because it's not their market.
The people who use these machines are usually kids who play the least demanding games they can run (Minecraft, Roblox, Dota, LoL, WoW, CS:GO, Team Fortress, Overwatch, Fortnite) on their parent's "office" computer.
Fortunately the landscape has improve with i3 CPUs becoming more common in offices, which are bundled with a reasonable Intel iGPU (in terms of API/driver support).
@darksylinc what about d3d12? For my game engine I went fully bindless with vulkan and d3d12, would vulkan 1.2 as min spec work?
what about d3d12? For my game engine I went fully bindless with vulkan and d3d12, would vulkan 1.2 as min spec work?
What about it? I don't understand the question, and feels a little off topic too.
Is d3d12 viable as better choise over d3d11? And which vulkan min spec is planned to be supported here? Vulkan 1.1, 1.2 or 1.3?
Is d3d12 viable as better choise over d3d11? And which vulkan min spec is planned to be supported here? Vulkan 1.1, 1.2 or 1.3?
I'm starting with D3D12, but my expectation is that I will put a D3D11 backend in as well. I almost certainly won't go lower than that; D3D11 was available starting with Windows 7 (and apparently backported to Windows Vista!), and that feels like more than enough legacy support.
I don't think we'll need anything beyond Vulkan 1.0, at least at first, as far as I can tell, but there's a lot of small details I expect to stumble over as I go, so we'll see what pops up that needs 1.1 or whatnot.
Anyhow, the initial goals are D3D12, Metal, and Vulkan, and then things like GL4/GLES3/WebGPU can be considered (and possibly abandoned).
Pushed changes that resolve most of the discussed things; I think the only thing still unresolved is what to do about generating mipmaps, which I'm still considering.
OpenGL does not support samplers being separate from textures at the shader level
Sorting through this feedback again, and I'm just making sure I'm clear on this part: Is this problem not solved by GL_ARB_sampler_objects?
It's not. That partially solves the problem on the C++ side, but not on the shader side
Ah, I see...HLSL lets you do myTexture.Sample(mySampler, texcoords) but GLSL only offers texture(myTextureUnit, texcoords). :(
Opinions welcome on the current plans for vertex shader syntax: icculus/SDL_shader_tools#3
Re: depth buffer formats
Just FYI: Depth24_Stencil8 isn't available in Metal on iOS (it is on macOS, though). Both platforms do support Depth16, Depth32Float, Stencil8, and the convenience format Depth32_Stencil8.
Maybe SDL GPU should add Depth32Float and Depth32Float_Stencil8, assuming those are also supported on DX12 and Vulkan..
(Just bumping this comment over here to have discussion stay in one place. --ryan.)
I'm not a huge fan of manually exposing Fences as a client-side resource. Since their intended purpose is to query command buffer status and wait on command buffer completion, I think these should be command buffer APIs (
SDL_QueryCommandBufferStatus/SDL_WaitForCommandBuffer), with the various backends handling the fences themselves.Making fence submission optional on the client side also has the side effect of making a Vulkan implementation significantly more complicated, since in most Vulkan renderers, command buffers always need a fence to manage and clean up internal state. This means that the backend would need to manage fences internally anyway, and since Vulkan only allows you to submit one fence per command buffer submission, internal cleanup and user-side notifications would have to be conflated. This seems unnecessarily complicated.
Originally posted by @TheSpydog in https://github.com/libsdl-org/SDL/issues/7067#issuecomment-1382251359
I had to dig through my source code to see how it's handled and realized TheSpydog is right, but with some remarks to do.
You only get one VkFence per command buffer. And Metal works in a similar way too (they just call it a semaphore, which has nothing to do with VkSemaphore).
But I would suggest a slight separation, having fences still exposed to user:
SDL_Fence fence = SDL_GetCommandBufferFence( cmdBuffer );
SDL_WaitForFence( fence );
That is, still expose fences to users, but make it explicit that fences come from Cmd Buffers (instead of the user being able to create them on a whim at random).
It also makes explicit that waiting for a fence requires flushing a cmd buffer, which in certain cases could have nasty side effects. If you fully separate fences from cmd buffers, then hiding this fact from users will leave in a gigantic world of pain.
Why?
It's a matter of separation of concepts. A 'fence' is something we are waiting for. A 'command buffer' is a huge chunk of work, potentially consuming a lot of memory, we may not even care about.
In OgreNext we have the following pattern for data we must download GPU -> CPU when highly accurate fence tracking is needed:
vkCmdCopyImageToBuffer( mQueue->mCurrentCmdBuffer, srcTextureVk->getFinalTextureName(),
VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL, dstBuffer, 1u, ®ion );
if( accurateTracking )
{
mAccurateFence = mQueue->acquireCurrentFence();
// Flush now for accuracy with downloads.
mQueue->commitAndNextCommandBuffer();
}
We don't really care about cmd buffers. We care that the vkCmdCopyImageToBuffer is done.
In fact the cmd buffer may already start being recycled while the VkFence is still not recycled (once we're done with the fence, we also recycle it too)
Detaching Cmd buffers from fences allows recycling to be separate.
What's an accurate fence?
In OgreNext, when you need to download data from GPU to CPU, you could do 3 things:
- Initiate download, 3 frames later (triple buffer) you access the downloaded data. This is protected by the frame's VkFence.
- Initiate download, do some mild work, check on the VkFence, if it's not ready do more work. Check again, if we're out of work to do, stall on the VkFence
- Initiate download, stall
Accurate fences are for dealing with scenario 2: the download is quite urgent, but we can do some work we have left rather than stalling.
What about command buffers without fences?
The schemes TheSpydog or I propose assume there is one VkFence per Cmd Buffer, while Vulkan has no such requirement (you can submit a cmd buffer without any fence).
Well, you could either assume all cmd buffers need a VkFence (in OgreNext we do that, because rendering engines are complex and for various reasons sooner or later you always end up needed that fence), or just add an API e.g. SDL_SetHasFence( cmdBuffer, true )
EDIT: I suspect disabling a cmd buffer's fence will be a bad idea. In order to know if a cmd buffer is over (and thus recycle or destroy it), you need a fence. If you submit two cmd buffers in a row, CMD A is not guaranteed to finish before CMD B is finished, unless you've used a VkSemaphore to explicitly state that CMD B depends on A.
Thus if you submit cmd buffers A without fences and without semaphores, and then B with a fence, you have no way to query when is A over (unless you submit new cmd C that issues a full barrier, or a vkDeviceWaitIdle).
But I would suggest a slight separation, having fences still exposed to user:
SDL_Fence fence = SDL_GetCommandBufferFence( cmdBuffer ); SDL_WaitForFence( fence );
The problem with this approach is that you are still exposing the user to fence management - now every time they submit a command buffer they will have to wait for the command buffer to finish and then free the fence that the submit command returns, or they will have a memory leak, and the Vulkan validation layer will complain loudly. It would be a major headache to do this without blocking the main thread.
Since waiting for a fence is essentially the same thing as waiting for the command buffer to finish working, I don't see any good reason to force the user to think about fences. I've designed two systems this way and it's never ever been a problem. Some console APIs don't even have fences!
While I'm here... I'm critical of referring to getting the presentation texture as GetBackbuffer. This is maybe pedantic, but I think the term "backbuffer" carries an implication that the memory will be stable across frames, which is not true in basically any modern API.
What I've done in my own API is have a call AcquireSwapchainTexture which takes a command buffer and returns the swapchain texture - if acquiring the swapchain texture is successful, the command buffer automatically sets up the presentation structure when the command buffer is submitted. If acquiring the swapchain texture was unsuccessful, the call returns NULL to let the user know. Doing it this way has some nice structural benefits for the Vulkan renderer at least, because we can automatically set up a semaphore to wait on the submitted command buffer being finished before presenting the swapchain image. If there's a problem with this approach, I haven't seen it yet.
Since waiting for a fence is essentially the same thing as waiting for the command buffer to finish working, I don't see any good reason to force the user to think about fences
No, it's not the same thing.
A command buffer can block literally hundreds of megabytes from being released. A fence does not.
I've run into this problem when building cubemap probes for reflections. A lot of temporary textures & buffers are created, and having a small array of 3x3 cubemaps with 3 split of PSSM means 6 faces x 9 cubemaps x 4 passes per face = 216 passes.
That's a lot of drawcall information stored in the cmd buffer(s). But it's not just that, you can't release tmp textures & buffers either until the cmd buffers have been reset/destroyed. Everything depends on those cmd buffers.
Carefully & properly flushing & releasing command buffers is the difference between running out of memory and generating these probes successfully. And you can't do that if you need to wait on a fence and that forces blocking the cmd buffer from being released.
You are right that my example code was flawed though, as the correct code would be:
SDL_Fence fence = SDL_GetCommandBufferFenceAcquire( cmdBuffer ); // increases ref count on fence
SDL_WaitForFence( fence );
SDL_FenceRelease( fence ); // We are done with the fence, decrease ref count
Unfortunately, you can't hide the fences unless the use cases will be trivial.
What you can though, is design an API so that fences are acquired from a cmd buffer, instead of creating them arbitrarily.
I am extremely confused. You can't safely release data related to commands in a command buffer or the command buffer itself unless the command buffer has completed, and the typical way to know if a Vulkan command buffer has completed execution is to check if its associated submission fence is signaled. Am I missing something here? I know there are ways to do fine-grained command checks using vkEvent but that seems unrelated to what we are discussing here.
I am extremely confused. You can't safely release data related to commands in a command buffer or the command buffer itself unless the command buffer has completed, and the typical way to know if a Vulkan command buffer has completed execution is to check if its associated submission fence is signaled.
That is correct. But the reasons to wait on a fence are varied.
Yes, the primary reason to wait on a fence would be to know if its associated cmd buffer is done and can therefore be recycled. But a secondary reason to wait on a fence is because we want to know if a particular command (like a vkCmdCopy*) inside that command buffer has finished.
If that command has finish, it means it's safe to access the mapped region from CPU. Hence the user may want to hold on to a VkFence for longer than it wants to hold on to a VkCommandBuffer
Thinking out loud:
Perhaps it may be possible to not expose an SDL_Fence if an SDL_CommandBuffer internally recycles its VkCommandBuffer (which means the same SDL_CommandBuffer could be internally associated with many VkCommandBuffer across the lifetime of an app).
When SDL determines a SDL_CommandBuffer is over, it recycles and unsets its internal VkCommandBuffer, but still keeps its VkFence alive for wait/querying purposes.
That would make debugging harder though, when things go wrong (e.g. a seemingly-impossible crash says the SDL_CommandBuffer ptr is the same, but the user fails to realize the internal VkCommandBuffer ptr went stale because this implementation detail is hidden from him)