Representation of proprioception observation space and action space (joint position / velocity, cartesian position / velocity)

Open HaomingSong opened this issue 11 months ago • 0 comments

Hi, Thanks for your excellent work! I have some questions about the representation of proprioception observation space and action space during training.

In the paper, you mentioned using the joint positions of the robot as proprioceptive state $q_t$. During the pre-training stage, is the action $A_t$ output by the model also the joint position of the robot?
How to deal with the differences between OXE and $\pi$ dataset in action space and proprioception observation space during the pre-training stage? As far as I know, the datasets in OXE Magic Soup all use cartesian position (i.e. delta end effector pose) as action space. If the $\pi$ dataset uses joint position as action space, how are these two representations unified during pre-training? Similarly, the proprioceptive state of some datasets in OXE Magic Soup is also represented only in the cartesian pose of the end effector in the base frame of the robot. How do you convert it into the joint position? (As far as I know, the end effector pose can be converted to the joint position using inverse kinematics, but this process often has multiple solutions)
The proprioceptive state $q_t$ and action $A_t$ representations in the post-training and the pre-training stage are different. In the code you provided, I found that Droid uses joint velocity to represent action $A_t$, while Libero uses end effector pose to represent proprioceptive state $q_t$. Both of these representations are different from the ones used in the pre-training phase in the paper. Could you please explain why and whether you think this will affect the performance of the model?

Feb 13 '25 12:02 HaomingSong