Question for the paper: f^hat(p)
Very impressive work! I have a question of how to compute the f^hat(p) in your paper.
According to my understand, f^hat(p) is the difference of the depth of point p and the depth of the ray (p is on this ray) with current training pose. However, this depth difference is not the definition of the SDF, also there is one case that this operation could not handle.
Suppose there are two Guassians, g1, g2. So the sdf of p should be the red line, right? For a pose, which is O, the f^hat(p) according to the paper, should be the green line.
Is there something that I am missing? Thanks for the great work, again.
Hello @SeanGuo063,
Thank you so much for your nice words!
You're right, in practice, $\hat{f}$ is just an estimator of the real SDF, so there may be cases where $\hat{f}$ is not accurate, such as the one you mentioned. However, in the case of Gaussian Splatting, depth maps are computed using Gaussian splats facing the camera (which is equivalent to saying the surface is made of small surface elements facing the camera), so using this estimator makes sense in the end. However, you should just see this SDF loss as a regularization tool that helps 3D Gaussian blobs not only align with the real surface, but also face cameras on average (which better regularizes the background than the simpler density loss we describe in the paper).
Please let me copy/paste part of my answer to a previous issue about a similar question:
Question: For Fig.5, why can we compute f(p) by computing the difference between the depth of p's projection and the true depth of p?
The estimator $\hat{f}(p)$ is a rough approximation of what would be the real SDF associated with the current scene. But this approximation makes sense in the context of "splatted" depth maps. Let me explain.
For a camera $c$ during optimization, we compute a depth map using the Gaussian Splatting rasterizer. This depth map is not perfectly true, as Gaussians are converted to flat splats facing the camera during rasterization. But still, it is a good approximation: we suppose that the depth map describes well the surface of the scene, as seen by the current camera.
Let's consider a point $p$ sampled using the product of all Gaussian distributions (for gaussians inside the field of view of $c$); since most Gaussians are very small, this point $p$ is likely to be located near the real surface of the scene. Consequently, the SDF value of $p$, i.e. its distance to the surface, should be equal to the distance between $p$ and the surface observed in the depth map, i.e. the distance between $p$ and the surface point that is the closest to $p$; let's call this point $q$. To approximate this distance $|p-q|$, we choose as $q$ the 3D point that corresponds to the projection of $p$ in the depth map.
Why do we do that? Because the depth map is "splatted", the surface observed in the depth map is approximately composed of small surface elements facing the camera (in practice the Gaussian functions smooth things, but still). Therefore, the point $q$, which is the surface point that is the closest to $p$, is likely to be the point located on the same ray/line of sight than $p$, i.e. the points that has the same projection as $p$ in the depth map.
This last point is a little tricky, but still, you should just see all this as a regularization tool on the density that allows for involving depth regularization (as we compute $\hat{f}$ usind the depth). It also encourages the Gaussians not only to align with the surface, but also to face the camera poses on average, which is a useful prior for regularizing the background.
I hope this answers well your question!