ARKitScenes icon indicating copy to clipboard operation
ARKitScenes copied to clipboard

Different camera intrinsics in 3DOD and raw

Open Yangr116 opened this issue 1 year ago • 16 comments

image Why are the camera intrinsics of the same scene different in the 3DOD and the raw data?

Yangr116 avatar Jan 13 '25 14:01 Yangr116

This is most likely because:

  1. 3DOD data uses camera intrinsics as estimated by the device's calibration and algorithms that run during the capture to account for things.
  2. raw is determined from a separate post-hoc global optimization process which optimizes the camera position, camera intrinsics, etc (i.e., a bundle adjustment). Arguably, these are the most accurate in the sense that they are most consistent from a global perspective.

I'm not sure I would sweat this too much, the camera even has mild lens distortion which I don't think was accounted for in the release. It's something we're addressing in the CA-1M update.

jlazarow avatar Jan 27 '25 17:01 jlazarow

How would this difference in camera intrinsics between 3DOD and raw affect using rectify_im.py to calculate the correct sky orientation of the videos?

I had tried to use, then modify, rectify_im.py to calculate the orientation of the videos (following suggestions in #47 and #51 ), but it did not work.

anjaligupta1104 avatar Jan 28 '25 01:01 anjaligupta1104

It shouldn't, the orientation is purely determined by the pose. Can you give me some video IDs to check? It should be noted that decide_pose is deciding the pose of a frame. This doesn't necessarily mean it always corresponds to some holistic sense of the orientation of the video. For instance, ARKit tracking may not be accurate for the first few frames of the video. The orientation of the device itself could change (but rarely should in these recordings). If you need to describe a single video, I would recommend taking a majority vote for the orientation over all frames in the video (e.g., np.bincount). If that still doesn't work, let me know the video IDs and I will check them out.

jlazarow avatar Jan 28 '25 02:01 jlazarow

Yes, taking a majority count is exactly what I did. Unfortunately, I noticed that this did not work for 40/150 of the validation scenes (after manual verification). I had plotted the z-vector components of some videos (I apologize, I cannot remember or track down the specific video IDs in this case). As you can see, they tend to be the same (with z_y near 1 and z_x near 0) so they are both classified with orientation of 'down,' yet in reality one of the videos is already upright.

Thank you!

rotation_matches.csv rotation_mismatches.csv

Image

anjaligupta1104 avatar Jan 28 '25 02:01 anjaligupta1104

Sounds good, but not sure the attachments worked?

jlazarow avatar Jan 28 '25 05:01 jlazarow

Edited comment. Please lmk if the updated attachments don't work!

anjaligupta1104 avatar Jan 28 '25 17:01 anjaligupta1104

Thanks. How are you doing the manual verification? For instance, I verified one of your mismatches: 45662987 looks fine.

The orientation is consistently reported as being upside down and looking at the corresponding vga_wide.zip seems to confirm that it is upside down.

Is it possible you're looking at frames from the raw MOVs and not the explicit image dumps?

jlazarow avatar Jan 28 '25 18:01 jlazarow

Ah, yes, manual verification was done by opening and looking at the raw MOVs. I wasn't aware vga_wide and mov didn't match.

Do you have a recommendation for calculating orientation of the raw MOVs? For the time being we used GPT-4o to classify frames, but I'm curious about a geometric / foolproof approach.

anjaligupta1104 avatar Jan 28 '25 20:01 anjaligupta1104

The orientations do match, however, probably the application you're using (e.g., VLC) might be detecting some stored metadata and automatically orienting it in a more visually pleasing manner. QuickTime generally shows the encoded orientation and does not apply metadata orientation. If you have ffmpeg/ffplay, you can ask it to play without applying metadata:

ffplay [video_name].mov -noautorotate

If you really want the presentation orientation of the MOVs (which it sounds like you want), you can also ask for this (https://superuser.com/questions/660505/find-out-rotation-metadata-from-video-in-ffmpeg):

ffprobe -v quiet -select_streams v:0 -show_streams [video_name].mov | grep -i rotation=

jlazarow avatar Jan 28 '25 20:01 jlazarow

Thank you for your reply. I would like to know why some images are rotated 90/180/270 degrees clockwise. For these rotated images, how are their 3D bounding boxes annotated? Are only the images rotated, or are other parameters such as camera intrinsics and camera extrinsics (pose) also rotated? How can I convert them into a normal view?

Yangr116 avatar Feb 14 '25 16:02 Yangr116

Hi @Yangr116 , Thanks for your question. For your question "how are their 3D bounding boxes annotated?", the ground truth 3D bounding boxes are actually drawn on the 3D room level mesh. The boxes can then be projected back to the images using their respective camera parameters.

PeterZheFu avatar Feb 14 '25 19:02 PeterZheFu

To add: @Yangr116. The images are not rotated, but the device itself (i.e., the iPad) is rotated which gives the sense that the images are not upright (even if the device rotates, the camera on the device cannot rotate). But as @PeterZheFu says, this rotation itself will be incorporated into the provided pose for that camera frame.

jlazarow avatar Feb 14 '25 23:02 jlazarow

Thanks for your answer! @PeterZheFu @jlazarow

I would like to confirm one thing “the ground truth 3D bounding boxes are actually drawn on the 3D room level mesh” and "this rotation itself will be incorporated into the provided pose for that camera frame." means 3D objects are not rotated. Correct? In other words, if I want to convert them into a standard view (without rotation), I need to revise the projection matrix of bboxes (from the world coordinate into the camera coordinate) and the camera intrinsics.

In addition, I would like to know how the cubifyanything handles these rotated images.

Yangr116 avatar Feb 15 '25 07:02 Yangr116

Correct. Think about it this way. In order to build the mesh, we could backproject each image + depth map + pose into a world point cloud and then mesh this. The inertial sensors on the iPad ensure that this world coordinate frame is upright (i.e., aligned to the ground). So, no matter what the actual rotation of the device was in each image, the pose (well, inverse of) includes the rotation to take the backprojected points from that (possibly rotated) image and ensure they are aligned with gravity. The 3D boxes are then annotated in this upright world and only include yaw rotations (around the gravity axis).

For Cubify Anything, the model only ever sees upright images (we pre-rotate them to align the gravity axis upwards). Similarly, the CA-1M data will only be released in this manner to avoid some of the issues seen in the ARKitScenes release.

jlazarow avatar Feb 15 '25 19:02 jlazarow

Thanks a lot!!!! May I kindly inquire about the timeline for the open-source release of the CA-1M? 

Yangr116 avatar Feb 19 '25 08:02 Yangr116