AI Offers Advantages for Control-Oriented Vision Systems for Robotics

By Ronnie Vuine, Micropsi Industries (Berlin, Germany; www.micropsi-industries.com)

Industrial robots excel at executing movements with submillimeter precision, notably when they have a single, well-defined target in space to navigate to. They perform path planning flawlessly—optimized for speed, distance, wear, and precision. This approach has the additional advantage of being highly accessible for a qualified human operator: A single set of 3D coordinates in a robot base coordinate system is easy for human operators to understand. Tool rotation descriptions, such as Euler angles or quaternions, are less intuitive, but, with experience, still reasonably easy for operators to read and check for plausibility.

While precise execution is a hallmark of modern robotics, precise measurement, unfortunately, is not. Cameras often capture the wrong coordinates because slight variations in lighting conditions, shapes, and colors throw them off. Measuring space in three dimensions is hard even under perfectly controlled lighting. And, a speck of dust on the lens can flip a single pixel that was needed to make an accurate measurement.

Perfectly executing a path to the wrong destination is still a bad result. Therefore, many robotics engineers have developed a love-hate relationship with cameras. When cameras perform reliably, the resulting robotics applications are impressive. But, getting cameras to perform reliably is difficult and sometimes impossible. So, to mitigate project risk and keep solutions simple, engineers often avoid using cameras altogether.

Relative Movements in Real Time

Unlike robots, human movement execution is fallible, and human measurement is too. And yet, we manage to solve problems of coordinated movements that are unfathomable to robots. It's because they're doing it wrong.

Humans don't measure, and they don't move to coordinates. They make relative, real-time controlled, rough movements toward the target, and they frequently correct. Instead of solving the extremely hard engineering problems of measuring to perfection once and then executing to perfection once, they solve a much simpler problem much more often—moving roughly toward the goal.

A similar strategy was only applied to robots recently because "move roughly toward the goal" was not a directive that a robot could understand—until the arrival of deep neural networks. Now, it's possible to train a neural network to understand the goal without having to explicitly specify it. And, it's possible to perform the neural network math calculations fast enough to mimic the human ability to "visual servo" toward the goal and make many small, quick corrections during the approach. Using a camera to guide the robot also brings with it an added benefit: Manual feature engineering is no longer necessary.

Figure 2: To get started with the MIRAI system, designers simply connect the robot, the F/T sensor, the camera(s), and the MIRAI controller. Set up time typically takes 30-45 minutes.

Feature Engineering

Feature engineering involves carefully applying human help to computer vision algorithms. An engineer thinks about features, such as edges or characteristic points, that the algorithms should pay attention to and then configures operations on the raw data—or even the physical world—to make the features easy to find. This requires having a lot of experience, a deep understanding of the algorithms, and knowledge of the tricks of the trade. The process is complicated and time-consuming.

Naive pattern matching is the most basic approach to measuring a 2D position. In this method, the camera is calibrated to detect the exact location of every image pixel. A location is determined by calculating the difference between a predefined pattern and all overlapping pattern-size patches of the camera image. These algorithms essentially count pixels to determine the part of the image that is most similar to what is being looked for.

More sophisticated algorithms—such as BRIEF, SIFT, or SURF—use filters to emphasize features or key points of an image and then perform pattern matching in feature space instead of on the raw pixels. This approach enables the algorithm to determine positions independently of scale and rotation, and the distance calculations between two potential matches can also be made a lot faster than they can in pixel space. For 3D problems, feature detection requires depth images, typically point clouds, created by dual-camera stereo systems or time-of-flight-based cameras that emit infrared light and measure the time it takes the reflection to arrive back at the camera.

Even in 2D, most of these techniques work best with some form of structured light projected onto the scene—from specific colors to make certain features pop, to grids to extract surface information. Small changes in pixel brightness or color easily throw off these algorithms. It is common to deliberately reduce image complexity by ignoring color (and all the useful information it may contain) or even by projecting light in such a way that a very robust bright reflective spot becomes a reference point for a feature description algorithm to latch on to. Sometimes this is performed in a non-visible part of the spectrum to avoid interference from everyday lighting changes. But if sunlight enters the scene, results will be skewed.

The fundamental problem is that tricks are being used to mask off information that would distract brittle algorithms. A proper way of reducing the image complexity would be to have algorithms that look at a scene as humans do. Humans consider all the information that reaches the eye—colors, shapes, brightness, reflections, and refractions—and they just know which information to ignore when it is not relevant to their goal.

End-to-End Deep Learning

To know what to ignore, an algorithm needs to know the ultimate use for the visual information. The need to separate the measuring part of the problem from the execution part is one reason computer vision used to be so hard.

When information is available on what to do—or how a robot needs to move—to a deep neural network processing an image, it can learn what to ignore. Alternatively, the network can also infer the right action. This means that combining end-to-end neural learning with real-time control provides a powerful idea.

Newer visual control systems are expanding the possibilities in robotics. Training a robot to pick up a cable and plug it into a slot in a test application requires using one of these new AI robot systems along with a hardware setup that includes a table or an automated guided vehicle (AGV) on which to place the robot, a gripper, and safety equipment. And, users will need to train the robot’s deep neural networks. This process is simpler than it sounds, thanks to products that package the required math.

To perform the training, end users position the robot's tool at the point in space where the tool is supposed to go, relative to the object. In the previous example, the goal involves locating the plug at the end of a cable, which dangles in the air. A simple 2D color camera mounted on the wrist of the robot provides a stream of images to the AI controller, training the robot on where to go, allowing the camera to capture the surroundings of the target. Then, users move the cable to create another scenario of how the problem may look. Next, users place the tool in the right spot, save the location, and show the robot around again. If users repeat this process for 20 minutes, they'll collect enough data for the neural networks to learn. The AI controller will then spin up its training machinery and crunch the numbers, calculating a skill for the robot that generalizes over all examples shown to it. After a while, this skill—a neural network trained with the recorded user data—will be ready. When executed, the network will read an image from the camera stream (typically, every 50 ms) and decide how to move the robot—not where to move it in precise coordinates, just how to move it closer to where it is supposed to be.

From the examples they have been given, the neural networks learn what information to ignore, such as certain details that changed from example to example, including light brightness, background, reflections, and the precise shape of the cable. The neural networks also learn what information is relevant to the problem. For picking a plug, the relevant information may be its color and shape and the best angle to approach it. Given enough consistent data, the robot will be able to pick the dangling plug.

When users first encounter an AI system, many expect that, because it is intelligent, the system can simply guess what it is supposed to do. This is not the case. The intelligence is in its ability to learn anything that it is taught, but it still requires good teachers. Successful AI comprises both good algorithms and good data. For some AI robot controllers, the algorithms are included, but the data must come from users, and they must make sense. Showing the robot confusing examples such as conflicting movements when everything looks the same will result in a confused robot. Show the robot consistent data—for example, the target is visible, and for similar images, similar movements have been demonstrated—and the robot behavior will be significantly more capable and robust than anything implemented using the old measure-and-execute paradigm.

Figure 3: This photo shows an application with ZF, a glob al technology company headquartered in Friedrichshafen, Germany. It uses the Micropsi MIRAI robot control system to automate machine tending in a high-volume milling station where gears are manufactured.

Using AI Controllers

Not all robot movements need to be controlled by cameras in real time. In fact, most movements do not. Camera control is typically needed only at the ends of movements—for instance, when finding an object whose rough location is known, when picking or inspecting an object, or when inserting or assembling objects. The movements in between (between two points in free space) are fairly simple and can be position-controlled and path-planned by a robot's control software. So, AI controllers for industrial robots leave that part alone. The AI controllers also never directly control the joints of the robot; they just tell the existing control stack what the controllers think should happen. The robot's own controller then decides whether the movement is safe, or whether it may need smoothing out. As a result, AI controllers do not replace robot controllers, and they shouldn't. They are add-ons, enhancing a robot’s capabilities.

Industrial robots work in environments also used by humans, who depend largely on the visible portion of the electromagnetic spectrum for information intake. Robots work in factories designed to be observable to human eyes, and the robots perform work that needs to be seen by humans. Everything in a robot’s environment is designed to be visually inspectable, and most available information on the state of a factory is encoded in rays of light.

If there was a way to extract that information from simple camera pictures (designed for the human eye), and decide what is relevant and what is not, cameras would be the ideal sensor. With deep learning, this way of extracting information is now possible, and with end-to-end learning, a way to tell the relevant and irrelevant apart is available too.