Embodied Visuomotor Representation


Levi Burner
Cornelia Fermüller
Yiannis Aloimonos
Perception and Robotics Group
at
University of Maryland, College Park


Abstract

Imagine sitting at your desk, looking at various objects on it. While you do not know their exact distances from your eye in meters, you can reach out and touch them. Instead of an externally defined unit, your sense of distance is inherently tied to your action's effect on your embodiment. In contrast, conventional robotics relies on precise calibration to external units with which separate vision and control processes communicate. This necessitates highly engineered and expensive systems that cannot be easily reconfigured.

To address this, we introduce Embodied Visuomotor Representation, a methodology through which robots infer distance in a unit implied by their actions. That is, without depending on calibrated 3D sensors or known physical models. With it, we demonstrate that a robot without prior knowledge of its size, environmental scale, or strength can quickly learn to touch and clear obstacles within seconds of operation. Likewise, in simulation, an agent without knowledge of its mass or strength can successfully jump across a gap of unknown size after a few test oscillations. These behaviors mirror natural strategies observed in bees and gerbils, which also lack calibration in an external unit, and highlight the potential for action-driven perception in robotics.

[arXiv]
[Code]




(A): The classic sense-plan-act architecture used in robotics assuming VIO is used for state estimation. Stability depends on calibrated sensors, such as an IMU, that provide accurate knowledge of the state in an external scale. (B): An architecture based on Embodied Visuomotor Representation. Compared to sense-plan-act, the embodied approach includes an additional internal feedback connection (red arrow) \blue{containing the control signal $u$. The units of $u$ are implied by the unknown gain $b$ and the dynamics going from control to the state in any scale, including the meter. Embodied Visuomotor Representation leverages position-to-scale, $\Phi_W$, obtained from vision and $u$ to determine a state estimate $\hat{x}/b$ in the embodied scale of $u$. Notably, the unknown $b$ cancels in the closed-loop system, enabling stable control without calibrated sensors. Direct methods such as Tau Theory, Direct Optical Flow Regulation, and Image Based Visual Servoing also avoid dependence on calibrated sensors by using purely visual cues (e.g., time-to-contact, optical flow, or tracked image features) for feedback. However, with few exceptions, such methods' stability depends on tuning the control law for the expected scene distances and system velocities.

Movie 1: Uncalibrated Touching and Clearing



Movie 2: Uncalibrated Jumping



Levi Burner, Cornelia Fermüller, Yiannis Aloimonos.




[arXiv]
[Code]