Harnessing VLMs for Real-Time Factory Decision-Making

This episode features Dijam Panigrahi, COO/Co-Founder of GridRaster, discussing how Vision Language Models (VLMs) integrate visual data and natural language to enhance reasoning and decision-making in industrial environments, covering training, deployment, and synthetic data use.

Key Highlights

  • VLMs integrate visual and language data to enable advanced reasoning and decision-making in industrial settings.
  • Edge deployment of VLMs allows real-time processing directly on factory floors, reducing latency and increasing efficiency.
  • Compute requirements for VLMs are significant, but advancements are making deployment more feasible in industrial environments.
Listen on Apple buttonListen on Spotify buttonListen on iHeartRadio buttonListen on Podbean button

In this episode, Jim Tatum interviews Dijam Panigrahi, COO of GridRaster Inc., about Vision Language Models (VLMs), a next-generation AI that blends visual data and natural language to enable reasoning, interpretation, and real-time decision-making in industrial settings.

The discussion covers topics domain-specific training, compute requirements, edge deployment, and the use of synthetic data to scale VLMs for real-world factory floors.

Visions: A Machine Vision and Automation Solutions Podcast, is the podcast for engineers, designers, integrators, and end users who want to keep an informed eye on the imaging and machine vision industry. Every second and fourth Tuesday we will explore the latest in imaging trends, developments and solutions. Here you will find interesting, useful insights and observations from expert interviews, solo episodes, even the occasional panel discussion, all of which aim to expand your knowledge on imaging and machine vision. 

Related: Vision Language Models Explained

Transcript

Well, hello and welcome to visions, a machine vision and Automation Solutions podcast. I'm your host, Jim Tatum, senior editor of Vision Systems design and visions is an endeavor business media production from your friends at Vision Systems Design. Here you'll find the latest on everything from end user machine vision solutions to trends, developments, and perspectives on all things machine vision and imaging. Whether you've been working in the industry for a while or you're just starting to take a closer look at it, this podcast is designed to grow your knowledge and bring greater focus to your understanding of the imaging and machine vision industry. And now on to our show. 
Robots that can really think and make decisions rather than just follow orders. Inspection systems that not only know when something is wrong, but why it's wrong. Well, yeah, really, this is happening and has been for about a decade now. Hi everybody, and welcome back to visions. I'm Jim Tatum and this is part one of a two-part podcast that will look into a very interesting and exciting technology vision, language models, a next generation AI technology. VLM is designed to augment traditional legacy vision systems by providing interpretive insights and active task guidance, transforming factory automation, and empowering workers with real time expert knowledge. By combining visual data and natural language to analyze and understand visual scenes, reason and make decisions, the technology is starting to gain traction in a number of areas, including machine vision and robotics. With the technology becoming more scalable these days, VLMs are starting to go beyond the lab and into the real world. In fact, VMs are successfully operating in manufacturing settings such as factory floors, doing things like enabling robots to go beyond simple, programmed, repetitive tasks and instead actually look at a complex, manufactured component reason about what they see against learned expert behavior and documented standards, and make quality decisions autonomously. Intrigued? So were we. So we reached out to Dijon Panigrahi, co-founder and CEO of Gridraster, a Mountain View, California based company that specializes in spatial AI and extended reality. 
When people hear the term vision language model, they tend to assume it's just another form of machine vision or something. But in plain terms, what is a VLM, really? And why does that distinction matter? Matter, say inside a factory or on a manufacturing floor? 
Yeah, I think all, all, all this VLM, LMS, they are all kind of come into picture based on what, what we have seen possible with the generative AI that's kind of happening, right? So initially it was a large language models where basically think of it that the wisdom of the world in text, we are able to kind of, um, dip into and get that wisdom, information, knowledge readily available to us. But that is all in text, right? The real world doesn't operate that way. You have different senses. Like text is one way of understanding the world. One of the critical aspect is you see the world, like how do you see the world like, and that's what the, the realms in a way, in a very short way, uh, gives you the eyes, right? Gives you, gives you the eyes to, um, you know, look at the world and basically make this generative AI. LM models really work in a real world because in real world, we, we see things, do things which you cannot just by do by reading. Now you got to see things. And based on it, you have to interpret and understand and then act on it. Do the reasoning, all of that that is possible. There's a the visual cues play extremely important role. That's what the VLM basically makes it possible. Okay. So it's basically doing reasoning interpretation and decision making or yeah, all of that or some of it or none of it. No, it's all of that. Right? Actually, the, the, the beauty about the VLMs are basically VMs kind of mimic the real world, the way humans operate in the real world. Without the VLMs, it's almost like you're blind and you're trying to operate. So what the learnings that you have, that's the LM without the eyes, right. And VMs are, you know, your whole understanding with the eyes, right. And by design, actually the VMs are multi-modal because the real world is multi-modal. Like we get in cues from audios, we get cues from what we see, how we operate. Uh, we read text instructions and manuals. So all of that is part when part of it, when you're operating in the real world. And VLM basically enables you to kind of mimic all of that and take it in that environment, run that out and be able to take actions. Right? So, computer vision was more, okay, you, you, it's computer vision initially didn't have like all of that reasoning and all of those things which were there. You can take, you know, image, you can identify based on the, the training that you have done that. Okay. This is, you know, a table, this is a chair and all those things. Training has happened. It would be able to kind of understand it. But if I, if I put, let's suppose take one example. Let's talk with an example, like a robot like needs to perform a job on any of the aircraft component, which we kind of kind of working on those, uh, depots and maintenance depots. Now first, how do you now in a in a computer vision scenario, what will happen if I run the computer vision? Yes, I would know. Maybe. Okay, that's a wing of the aircraft. That's all right. But now if you want to really go and act on it, which is what the humans are supposed to do, like go and repair, uh, and if there is any obstruction on the way, you are able to kind of navigate that obstruction, computer vision is not going to allow you to do that. Right. And that's where the VLMs come in, right? They not only they understand there is this aircraft, there is this wing. The wing of the aircraft is something which is what I will be repairing, not really a chair or a table in a, in a depot environment, if some obstruction comes on the way, it is able to figure out now this is an obstruction. There's something that was not supposed to be there. And you basically are able to navigate that and go and do what you want, what the robot needs to do on that craft, right? It is able to even understand the human like in, in settings where you're working with the, you know, in most settings right now are a collaborative setting where you are taking advantage of the human intuition, with the robot's effectiveness in terms of doing repetitive stuff or dangerous stuff and all of that. So in those scenarios, like in a VLM model, it will understand, okay, if I give a gesture, right, okay, stop like I do it another human just stop. Right. I don't in that noisy environment, you may not be able to listen, but you can get the gesture. The model is able to get the gesture. The pure computer vision will not. Right. So this reasoning we bring in along with that vision and all other cues that are available, which is kind of multimodal in nature, gives the basically the power to the VLM to do certain things which was earlier not possible. 
Okay. Does this change the number of, um, live employees or does it just change their function? 
Uh, it basically the way we, I will see it is basically the, the jobs will get redefined, right? Uh, let's assume like currently we have jobs where there is, there is a person because there are certain things which are running and somebody just has to keep pressing a switch, right? You know, because part of that, you you see that this has gone there. You press the switch, it goes to the next one, press the switch goes to the next one. Now, there are many jobs like that where, you know, you take a caliper and I go and measure what is the damage that is there, right? There are, I mean, all of these things which can be very easily the VLM will be able to automate all of these things. Right. And in. Yeah. In that case, what happens? The person who is doing. The end of the day, if there is a damage, the operator is supposed to repair it. But part of that repair process is understanding where is the damage, how much is the dimension, what type of dimension, how, what type of defect it is based on which he or she will work out the right kind of repair. Now a lot of this can be completely automated just using the VM models. You can basically automate say that I know here is I can see the damage. Here are the dimension here is in the x, y, z coordinate. The spacing is like this. So classification is this. We should now repair that with this type of repair. So that way that whole process we should have taken maybe half an hour or even sometimes more than that, because you are waiting for an expert to make a make a judgment what kind of defect it is. Now those results are already out there for you, right? And the operator purely focuses on kind of doing the repair work rather than this, all things which can be easily be taken care by the VLMs. Right. Okay. So that's one. The second is also the instructions, right. Suppose I want to kind of repair some complex machine and more and more, if you see the electronic systems and more and more become so even the industrial systems are becoming more and more industrial like, similar to what we are seeing in the cars. Right? So similarly and increasingly complex, like multiple systems operating together. Right. And sometimes it's extremely difficult for somebody to do it all by themselves. It's just extremely taxing, but this VLM models because they can understand what needs to be done. They will see an expert doing this thing. You feed all the instructions to the VLM model, give them the visual cue of the expert doing that job or doing the repair, assembly, installation, reinstallation, whatever you talk about it the model. Now completely understand what it should be. Now somebody who is doing that assembly work in real time can, uh, get, you know, active task guidance that, okay, do this even if you are not an expert now, you can operate at a level, you are an expert, even if you're not, if you're by mistake, doing by some means you are doing a mistake or not taking the correct steps. It will, in real time can flag you that you missed the step you need to do. So all of this is where the VLM is able to help that person who still has to do a job, right? So it allows it to do that job much faster and, uh, is able to do it with much getting it all time like. Like first time, right? Almost like the quality and all all of that aspects kind of improves. 
Okay. So the progression is from, you know, several people on a line manually inspecting an aircraft part.
Yes. 
They're looking for a defect. They may or may not see. They see, it takes several of them to understand what it is and exactly what needs to be done. The next step, of course, would be a legacy machine vision system that identifies the defect and transmits that information to those same people who still have to figure out exactly what it is and where it is. And this kind of brings it all together. 
Absolutely. You put it all together very well, right? So you kind of don't need all the inspectors, but you still need people who can actually manipulate and see and understand. Yeah. Until the until the FAA kind of changed the rules. You will still need inspectors. Honestly. Yeah. The systems are kind of, um, really kind of improving a lot to kind of take care of all of those things. The few things the VLM model, like for example, there are still elements that you are trying to kind of tackle like because the models, the way that is generated, particularly in the industrial settings, it's not that they have enough data in which those things have been trained. So you are taking those data sets to kind of train in those specific environments so that all the false positives or the negatives and all that is kind of taken care of, right? It's an improving system. But yeah, ultimately it does wonders in terms of the productivity, the efficiency, all of those, which in the current setting in us, right? You absolutely have the need for it just because you don't have enough skilled labor to address the demand that is there. 
So follow up and just make sure I'm understanding this. Um, one, how does a VLM learn context as opposed to just visual pattern and two, are there any limits to what tacit knowledge can be captured through observation? 
Yes. So the VLM, as I said, like the the more it kind of sees like expert operating in a certain environment, it's basically you can almost say that expert knowledge will. Many times people call the tribal knowledge is kind of captured in the VLM. Now. Right now, if you have that knowledge that is captured here, right. Anybody. So it provides you a medium that anyone, even with whatever skill set, maybe they are just one year into it, two year into it doesn't matter. They can access that expert knowledge on the go. Right? That's the beauty about it, right? It's almost like converts everybody into an expert. What sort of challenges have there been on the way? I mean, there are some downsides to it that we haven't overcome or working on overcoming. Yeah. There are still, um, quite a few things that needs to be done specifically for environments where those data is very, very specific. Like, for example, what we work with the Department of Defense, U.S. Air Force and all, a lot of those, um, environments, uh, which are there are very unique. It's not like the VLM models that are available have been kind of trained on those kind of environments. And many of these environments are very, very unique to how they function. Right? So one of the challenge is how do you with the existing VLM models that are there, how do you train for that domain specific or the environment specific nuances, then only they will be kind of productive in a real environment, right? So there is a training component that is kind of involved so that you are making those BLM models work in those unique settings. Okay. That's one which still has to be done. The second is, you know, like the VLM models, you know, to run, they need like, you know, huge compute to give you the results or the accuracy of the results that you're looking for. This model really consume a lot of compute, right? And how do you do that in industrial setting? Right. How do you really how do you put those kind of compute on those kind of environments? But I don't know if you've been tracking Nvidia came up with the DGX Park, which is one of like the edge compute element to kind of support many of this. Uh, so some of those advancements that are happening is what we are kind of taking advantage of it. Obviously, uh, it just running it from the cloud versus running it on the edge has its own constraints. So you're kind of running, then you're also optimizing the data. Now when I, because the more, uh, relevant data that you give those models, you know, they're going to be much, much, much, much, much better right now when you go to all this industrial settings, I mean, you have only so much data, right? So how do you create the variation scenario? So there is something that we do. We generate the synthetic data, uh, from the real world data to kind of train the models to understand, uh, the scenario, the reasoning, the intentions and all of that much better. And which allows us to then kind of reflect that in that real world setting, much better. 
Well, that's a wrap for this episode of visions produced by Endeavor Business Media, a division of endeavor B2B. Thanks very much for tuning in. If you enjoyed today's show, be sure to subscribe to the podcast and share this episode with a colleague who would find it helpful. Until our next episode, you can find us at vision dash systems dot com or on LinkedIn, Facebook, or for more insights, updates, and breaking news to keep you in the know. Thanks for tuning in. Until next time, stay focused on your visions.

About the Author

Jim Tatum

Senior Editor

VSD Senior Editor Jim Tatum has more than 25 years experience in print and digital journalism, covering business/industry/economic development issues, regional and local government/regulatory issues, and more. In 2019, he transitioned from newspapers to business media full time, joining VSD in 2023.

Sign up for our eNewsletters
Get the latest news and updates

Voice Your Opinion!

To join the conversation, and become an exclusive member of Vision Systems Design, create an account today!