Building and maintaining an optimal dataset for your deep learning model

By Jarek Bogalecki

Deep neural networks continue making headway into industrial vision systems, as more businesses than ever are considering how to use deep learning to optimize expenditures and improve output quality. While an exciting concept, businesses may not understand the reality behind the consumer narratives promising low-cost, quick, and foolproof AI-based inspection replacing traditional applications. Deep neural networks require proper data during the training phase as well as regular upkeep to ensure long-term accuracy of results. With proper preparation, this regular upkeep does not take up significant amounts of time. Correct long-term maintenance of training data still calls for some level of knowledge and expertise, however.

Creating training datasets

The basis of a deep learning-based inspection application is its dataset, a collection of images used for training. Through this dataset, a neural network learns exactly what it should look for, for instance shapes, spots, and colors specified in advance (as in feature detection or optical character recognition (OCR) networks) or simply anything that does not fit a prescribed template (as in anomaly detection), for example. The trained neural network is called a model.

Someone new to deep learning often asks, “What kind of images do we need?” The answer depends on the application. Some neural networks, such as those designed to recognize print characters, usually train on a variety of samples from multiple sources to ensure maximum flexibility. Even though the system only must detect a relatively limited pool of characters—namely the letters from A to Z, numbers from 0 to 9 and some additional marks to boot—these characters will appear in a multitude of typefaces and sizes (Figure 1).

Figure 1: Annotating an image to train an OCR model in Zillin.io: assigning a value to characters marked by oriented bounding boxes help teach the network one of the multitude of shapes the character may take, depending on its size, typeface, and many other conditions.

On the other hand, a deep learning model tasked with inspecting features on the surface of a metal pipe will need to train with images taken from that specific type of metal pipe and with all the different features that might appear on that specific type of pipe. Since deep learning models are usually trained to function against a well-defined object, it is better to focus on that object rather than strive for unachievable universality that makes the model unreliable.

The images then go through the annotation process, adding visual cues to teach the model that an image or a selected part of an image belongs to a given category. A bounding box, a rectangular segmentation mask drawn on selected parts of an image, is the simplest form of annotation. The section of an image inside the bounding box is labeled to identify it as something the deep learning model looks for during training. A multitude of images, each with a box drawn around an apple that is then labeled “apple,” trains a deep learning model how to detect apples in new images.

Metadata assigned to either an entire image (e.g. an “OK/NG” sort of label to classify images as either acceptable or deviating from the original template without the need to draw graphical masks on them ahead of training) or just a section of an image provides additional information. Datasets meant to train deep learning models for OCR inspections, for example, will typically have their annotators enclose each character within a bounding box and subsequently attach a piece of metadata with the value indicating what the pixels underneath the mask are supposed to mark. Aside from bounding boxes, other marking tools include polygonal shapes, paths, or brushes (Figure 2). The latter would be particularly useful for feature detection, among other applications.

Figure 2: Annotating a complex shape requires more advanced tools than the bounding box. Above, you can see an example of a polygonal marker available, among others, in Zillin.io, used to label the areas covered by the taxis and the bus.

After clarifying the required type of samples and the method of annotation needed for model training, the user determines the amount of data required. This also depends on the specific application. When training a model to detect specific features or objects that may appear to the camera lens at varied angles, datasets will be on the larger end of the spectrum to provide the model with numerous examples of what to look for. On the other hand, when trying to pinpoint any deviations from the established norm, it will suffice to feed the model with images depicting acceptable templates.

Striking the right balance between too few and too many samples is crucial. Having a limited dataset at our disposal—something common in an industrial environment, particularly with models that require training on defective samples not readily available at a manufacturing site—presents an obvious challenge.

Consider a model tasked with differentiating between a dark scratch on the surface of a plastic bottle, that would make it unsuitable for sale, and a simple hair resting on the plastic. The annotation process would consist of using two brush-type tools, one assigned the metadata “Scratch” and the other “Hair,” and then drawing masks to encompass a sufficiently precise contour of scratches and hairs (Figure 3). However, even well-defined features will not provide for a reliable model if it is fed only one or two pictures of each category.

Figure 3: Using the brush-type tool available in Zillin.io, masks are drawn to represent either of the “Scratch” or “Hair” category, each assuming a different color for easier visibility. The annotations then export as a set of properly indicated coordinates, ready for model training.

Too large a batch of training images, on the other hand, risks overfitting a model. Instead of relying on subtle clues, an overfitted model will look for features corresponding exactly to those indicated by the annotation masks in the dataset, resulting in inflexibility and inefficiency. Starting from several dozen pictures and then verifying the evolution of the model accuracy after increasing or reducing the number of images provides one method to determine the optimal dataset size.

Fresh training images retain model accuracy

Even an optimally trained deep learning model will not stay reliable forever. “Model drift,” defined as decay in a deep learning model’s accuracy over time, has several common causes, like changing environmental conditions. Consider a model trained with images of objects resting in direct sunlight. The model may be less accurate when trying to analyze images of objects not resting in direct sunlight and therefore much darker by comparison.

Model drift may occur accidentally, if cameras or illumination reposition, for instance, or if equipment deteriorates over time. It may occur deliberately, if a model trained to detect one type of feature is later used to detect a similar-but-different feature, for example. Whatever the cause, model drift may require retraining with new images that require annotation, re-annotating existing images, or eliminating of certain image from the training dataset.

Facility management may call for retraining only when a model ceases to perform well under new circumstances. Model retraining on a fairly regular basis is more common in industrial applications, however, regardless of whether there are any significantly different samples to take into consideration. Other models will be automated to retrain upon receiving a given amount of new data. While possibly more tedious than the ad hoc approach, frequent retraining of a model generally eliminates the need for a comprehensive change of the training dataset, instead introducing modifications to the model gradually over time.

The development of a neural network model’s dataset is never actually finished as long as the lifecycle of the target system continues. While capturing the needed images may not represent difficulty for a business, the process of annotating the images becomes more laborious as the required number of images increases.

Businesses may therefore elect to outsource this image annotation process to third party services, like Zillin.io, which help coordinate the efforts of dozens or hundreds of annotators all across the world. These services also provide image hosting and storage as well as safe channels for file transfers. This makes it easier to embrace the reality that while neural networks make for powerful tools, they are not fire-and-forget solutions. They require care and attention throughout their lifecycle to maintain their accuracy and reliability and thus continue to be important, time-saving elements in modern machine vision-based inspection systems.

Jarek Bogalecki is a Computer Vision Application Engineer at Adaptive Vision

Building and maintaining an optimal dataset for your deep learning model

By Jarek Bogalecki

Related

Zebra Technologies Integrates Acquisitions into Machine Vision Product Line

Fundamentals of Line Scan Imaging, Part 2: How to Apply It to Machine Vision Applications

Voice Your Opinion!

To join the conversation, and become an exclusive member of Vision Systems Design, create an account today!

Trending

Porsche Adds AI-Enabled Robotic Paint Inspection System

Lion Vision Develops Battery Detection System

Focus on Vision: Robotic Paint Inspection, Battery Detection | June 20, 2025