Cost-Effective Robot Control Using Deep Learning and a Single RGB Camera

MIT researchers introduce a deep learning approach enabling robots to be controlled using just one RGB camera, potentially broadening design possibilities and reducing costs.
Sept. 2, 2025
3 min read

What You Will Learn

  • The method uses deep learning and a single RGB camera to control a wide range of robots, from industrial arms to soft bio-inspired designs.
  • Training involves unsupervised learning from several hours of video captured from 12 perspectives, enabling the model to understand robot geometry and motion response.
  • Experiments confirmed the system's effectiveness on various robots, opening new possibilities for robotics in unstructured and complex environments.

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL)  developed an approach to robot control that depends on deep learning and images from a single RGB camera—rather than on input from multiple types of sensors or complex hard-coded motion control models.

Researchers say a key advantage of this approach is that it can work with many types of robots—ranging from rigid industrial arms to soft bio-inspired robots, which usually don’t work well with hard-coded approaches to motion control. The approach from MIT also could lead to lower cost robots, broadening their use across many industrial applications and environments. 

Related: High-Speed Vision System Used for Human-Robot Collaborative System

As the researchers explain in an article in Nature, “Our method unshackles the hardware design of robots from our ability to model them manually, which in the past has dictated precision manufacturing, costly materials, extensive sensing capabilities and reliance on conventional, rigid building blocks.”  

Training the Neural Network

To get to this point, the researchers used an unsupervised method to train their machine learning model using several hours of video of a robot executing random commands. The video was taken by 12 RGB-D video cameras—model D415 from RealSense (Santa Clara, CA, USA). The researchers recorded images from 12 perspectives both while a command is being executed and after each command is finished.

During the training, the model learned the robot’s geometry and how it responds to control commands.   

Related: What's the Difference Between Cobots and Industrial Robots?

After the training was complete, the model only needed one camera to acquire images from a video stream. However, the model does not generalize across robots—it needs to be retrained for each new robot. Direct human involvement isn't necessary for the training process. 

There are two pieces to the model for production use. First, a deep learning algorithm uses a single video stream to encode a robot’s 3D geometry and how any point in the 3D representation will move in response to a specific command.

The second algorithm uses that information to determine the correct motion command for the robot. In the article, researchers describe this model as "an inverse dynamics controller that parameterizes desired motions densely in the 2D image space or 3D, and finds robot commands at interactive speeds."

The Computer Vision-Based Approach to Robot Motion Control Confirmed in Experiments

In a series of experiments, the researchers demonstrated the effectiveness of their model on numerous robot systems: a 3D-printed robot arm, a hybrid soft pneumatic hand capable of pinching and grasping, a rigid Allegro Hand from Wonik Robotics (Seongnam-si, Republic of Korea) with 16 degrees of freedom, and a low-cost open-source educational robot.

Related: Focus on Vision: Precision Agriculture, Vision-Enabled Robot Control | August 8, 2025

The approach could have enormous implications for expanding the range of activities and environments open to robotics, the researchers explain. “This opens the door to robust, adaptive behavior in unstructured environments, from drones navigating indoors or underground without maps to mobile manipulators working in cluttered homes or warehouses, and even legged robots traversing uneven terrain,” says co-author Daniela Rus, MIT professor of electrical engineering and computer science and director of CSAIL.

About the Author

Linda Wilson

Editor in Chief

Linda Wilson joined the team at Vision Systems Design in 2022. She has more than 25 years of experience in B2B publishing and has written for numerous publications, including Modern Healthcare, InformationWeek, Computerworld, Health Data Management, and many others. Before joining VSD, she was the senior editor at Medical Laboratory Observer, a sister publication to VSD.         

Sign up for Vision Systems Design Newsletters
Get the latest news and updates.

Voice Your Opinion!

To join the conversation, and become an exclusive member of Vision Systems Design, create an account today!