Vision Language Models (VLMs) Explained: Augmenting Machine Vision and Robotics in Manufacturing
Key Highlights
- VLMs process both visual and textual data to understand complex scenes and make informed decisions, mimicking human reasoning.
- VLMs serve as tools to enhance human operators' skills, providing real-time guidance and error correction to improve quality and efficiency.
- Training VLMs requires large, environment-specific datasets and high computational power.
Robots that can actually think and make decisions? Inspection systems that not only know what’s wrong, but why?
These are but a couple of applications possible with vision language model (VLM) technology, says Dijam Panigrahi. co-founder/COO of GridRaster (Mountain View, CA, USA), a company that specializes in spatial AI and extended reality. VLM, a next-generation AI technology that combines visual data and natural language to analyze and understand visual scenes, reason, and make decisions, is starting to gain traction in manufacturing technologies, such as machine vision and robotics. In fact, VLMs are already successfully operating in manufacturing settings such as factory floors, enabling robots to go beyond a programmed repetitive task and instead actually look at a complex manufactured component, reason about what they see against learned expert behavior and documented standards, and make quality decisions autonomously.
Nonetheless, while the technology shows great promise, there is an initial reluctance to adopt VLM technology, says Panigrahi.
“When anything new comes along, I think we as humans are just resistant to change,” Panigrahi says. “I see this (VLM) as a tool, basically. People who are able to utilize this tool will be more effective, and in the long run, I think they would be the most desired workers or operators employers will seek.”
Related: Drone Based Water Intelligence System Utilizes SWIR and NIR Vision and AI
Related: Anomaly Detection for High Speed, High Mix Production Environments
What Is a VLM and How Does It Work?
A VLM is a type of artificial intelligence model that processes both visual data and natural language, learning a shared representation of a given scene that allows the system to reason across both modalities at once, Panigrahi says.
Because the real world is multimodal, VLMs are multimodal by design—combining vision, text, and contextual cues the way humans do to understand environments and act within them.
“VLMs are your whole understanding with the eyes, right?” Panigrahi says
A VLM is trained so that images and text are aligned within the same scene to enable the model to answer questions about images, follow visual instructions, generate descriptions, and perform reasoning grounded in visual evidence, Panigrahi says. Data sources for training VLMs can include real sensor data, such as data that comes from cameras, synthetic data from digital twins, and edge-case and failure scenarios.
The VLM is not a replacement for legacy machine vision, but rather an augmentation of it, Panigrahi says. For example, a legacy vision system automated inspection application may be designed to determine whether a weld is faulty, or a drilled hole is the correct prescribed size for a product, while a VLM is designed to be able to analyze that information and produce advice as to why the weld might be faulty or the hole is the wrong size—because it has been trained with information defined with prompts and context. In other words, the VLM learns how a human describes and interprets what they see.
“You feed all the instructions to the VLM, give it the visual cue of the expert doing that job or doing the repair, assembly, installation, deinstallation, whatever the task you’re doing is,” Panigrahi says. “The model now completely understands what it should be.”
VLMs not only add an interpretive facet to a legacy machine vision system but can help a human employee operate at a higher level of job competence, even if the employee has limited experience working that job, Panigrahi says.
“Now, somebody who is doing, say, assembly work in real time can get active task guidance that says, okay, do this,” he says.
And, if the operator does make a mistake, the VLM can flag that mistake and provide instructions/information as to how to fix the mistake and perform the action correctly.
“If you have that expert knowledge in the VLM, it provides you a medium that anyone of almost any skill set—maybe they are just one year into it, doesn't matter—they can access that expert knowledge on the go, right? That's the beauty of it; it’s almost like it converts everybody into an expert.”
Addressing VLM Challenges
While VLMs are becoming more scalable as technology advances and costs decrease, there are still challenges to address, Panigrahi says. For one, VLMs require huge amounts of data, and that data often needs to be specific to the given working environment. For example, GridRaster does work for such entities as the U.S. Air Force and the U.S. Dept. of Defense.
“A lot of these environments are very unique—it’s not like the VLMs that are available have been trained in those kinds of environments,” Panigrahi says. “So, the challenge is how do you train for that specific domain?”
Because real-world data, especially in those environments, can be limited, synthetic data generated from real-world data via a digital twin can be utilized to train the VLM, he says.
Hand in hand with the challenge of acquiring the data to train the VLM is the challenge of providing enough computer power to handle that data quickly and accurately, without generating false positives or creating latency.
In order to run, VLMs need huge computing power to give accurate results. “How much (power) does one model consume? And how do you do that in an industrial setting? How do you put that kind of computer power in those settings?”
Fortunately, computer technology is advancing, he says, noting examples such as NVIDIA’s DGX Spark, a desktop-sized computer designed for building, tweaking, and operating large AI models locally, then easily move them to a data center or the cloud.
About the Author
Jim Tatum
Senior Editor
VSD Senior Editor Jim Tatum has more than 25 years experience in print and digital journalism, covering business/industry/economic development issues, regional and local government/regulatory issues, and more. In 2019, he transitioned from newspapers to business media full time, joining VSD in 2023.

