Training a computer vision model is one component of a complex and iterative undertaking that can often seem daunting. At alwaysAI we want to make the process as simple and approachable as possible. To get you started, we have compiled a general overview of the training process of Deep Neural Networks (DNNs) for use in computer vision applications. We will focus on supervised learning in this overview, which uses labeled training data to teach a model what the desired output is. This article provides an introduction to each component of the model training process.
The following sections are covered in this article:
Types of Models
Different computer vision models help us answer questions about an image. What objects are in the image? Where are those objects in the image? Where are the key points on an object? What pixels belong to each object? We can answer these questions by building different types of DNNs. These DNNs can then be used in applications to solve problems like determining how many cars are in an image, whether a person is sitting or standing, or whether an animal in a picture is a cat or a dog. We’ve outlined a few of the most common types of computer vision models and their use cases below.
NOTE: In general, a computer vision model output consists of a label and confidence or score, which is an estimate of the likelihood of correctly labeling an object. This definition is intentionally vague, as ‘confidence’ will mean very different things for different types of models.
When describing the different types of models and their use cases, we’ll outline an example use case of a virtual wardrobe: an application that lets users try on different clothing items virtually before making a purchase.
Image classification attempts to identify the most significant object classes in an image. In computer vision, we refer to each class as a label. For example, we can use a general classification model, such as ‘alwaysai/googlenet’, to identify items of clothing, such as ‘running shoes’ or a ‘sweatshirt’, as shown below. The model would take an image as input and it would output a label along with the confidence the model has for that particular label compared to other labels. Additionally, DNNs for image classification tasks do not provide the location of the object in the image. So, for use cases where we need this information to track or count objects, for example, we need to use an object detection model, which is the next model described.
When the location of the object is of importance, object detection DNNs are typically used. These networks return a set of coordinates, called a bounding box, that specifies an area of the input image containing an object, along with a confidence value for that bounding box and a label. For our ‘virtual wardrobe’ application, we would need an input image of the person who wants to try on the virtual clothes, and then we would need to find that person in the image. To do this, we could use an object detection model for person detection, such as ‘alwaysai/mobilenet_ssd’, that would return a bounding box around every person in an image, along with a label, ‘person’, and a confidence value for the output. An example of an object detection model that can distinguish people is shown below.
NOTE: Knowing the location of objects in a frame allows us to infer certain information about an image. For instance, we could count how many vehicles there are on a freeway to map out traffic patterns. We can also extend the functionality of an application by piggy-backing a classification model onto an object detection model. For instance, we could feed the portion of the image corresponding to the bounding box from the detection model into the classification model so we could count how many trucks there are in the image versus sedans.
Now we’ve seen how we can classify clothing into groups, such as shoes or sweatshirts, and we can detect a person in an image, but we still need to be able to let the user try on the clothes. This requires being able to distinguish the pixels that belong to the detected object from the pixels in the rest of the image. So, we need to use image segmentation, which we’ll cover next.
As we described above, in some tasks it is important to understand the exact shape of the object. This requires generating a pixel-level boundary for each object, which is achieved through image segmentation. DNNs for image segmentation classifies each pixel in an image by either object type, in the case of semantic segmentation, or by individual objects, in the case of instance segmentation.
NOTE: Currently, the alwaysAI platform supports semantic segmentation. We are looking to grow the platform and are adding new models, including ones that perform instance segmentation.
For our virtual wardrobe application, we could use semantic segmentation to distinguish the pixels that belong to the person, sweater, or shoe. We can then replace the pixels in one object with those from another object, such as replacing a sweater that the person in the original image was wearing with the pixels from the new sweater that the user wants to try on.
NOTE: a popular use case for semantic segmentation is a virtual background used for teleconferencing software like Zoom or Microsoft Teams: the pixels belonging to a person are distinguished from the rest of the image and are segmented out from the background.
In our examples for image segmentation applications for people, we saw that image segmentation enables us to distinguish which pixels belong to each object in an image. However, image segmentation doesn’t enable us to infer anything about the relative position of objects in the image, such as where a person’s hand is or where the taillights and bumper are on a car.
For that, we would need information about specific areas of a person, for example, where a person's hand is compared to their head. This requires tracking object landmarks, which is the last model type we’ll cover here.
Object Landmark Detection
Object landmark detection is the labeling of certain ‘key points’ in images that capture important features of an object. For our virtual wardrobe, we could use a pose estimation model, such as ‘alwaysai/human-pose’, that identifies body key points such as hips, shoulders, and elbows, similar to the image shown below, to help our users accessorize. We could use eye key points to place glasses or a hat on the person in our virtual wardrobe, or use the ‘neck’ keypoint to let them try on a scarf.
NOTE: another useful application of key points is checking for proper form during exercises and sports.
Model Applications in General
Computer vision models can be applied to a whole host of applications. You could build a classification model for classifying types of dogs in a dog show or build a detection model to find cancerous cells in biopsy slides. Conservation biologists could use one model to detect the presence of a particular family or genus, and then feed that output into a classification model to determine species - aggregating this data to track conservation efforts.
Semantic segmentation is used for self-driving car technologies. An object detection model could be used to count items and generate inventory at a grocery store. The possibilities for computer vision are extensive!
Computer vision model training begins with assembling a quality dataset. As the adage goes, “garbage in, garbage out”. But what constitutes “garbage” for a computer vision dataset? In computer vision, “inference” is the term we use for applying a trained model to an input to infer an outcome. We like to say “train as you will inference”. So, a good rule of thumb for a quality computer vision dataset is that it is similar to the real-world data that will be input into the trained model. To ensure this, consider the angle of the image, the lighting and weather, whether the desired objects are obscured, how far away the desired targets are, the resolution and scale of the image, and the background and foreground.
Types of Dataset Generation
There are a few different ways to generate a dataset depending on your timeline and desired use cases. If you want a general detection model to fit into a prototype application right away, you may want to try to find an existing dataset complete with annotations. If you instead want a model that does one specific task very well, you’ll probably need to collect your own images that more closely resemble the environment the model will be used.
For instance, if you just want a model that will detect birds in general, you could probably go with a pre-existing dataset of images containing birds in different settings. However, if you are the conservation biologist we described earlier and you want a model that will be used to consistently detect birds appearing on the feed from a particular camera you’ve set up, you should collect images from the perspective of that camera to train the model on.
We’ll describe both of these options, as well as a couple of others, below. We’ll also cover data augmentation, which you can use to increase the size of any dataset.
Using an Existing Annotated Dataset
Depending on what you intend your model to detect, there may be public annotated datasets for you to use. They can drastically cut down the time it takes to train and deploy your desired model.
However, you will have much less control over the data quality this way. Since the training data might not as closely resemble the data you'll be using as input for the model, the performance of your model may suffer. Therefore, this approach may be best for proof of concept projects, and you will likely need to generate your own specific dataset for your unique application.
Some popular existing datasets are Common Objects in Context (COCO), PASCAL Visual Object Classes (VOC), ImageNet, and Google’s Open Images Dataset V6. Some public datasets have non-commercial licenses, so remember to always check the license of any existing dataset you use.
Using Existing Data or Collecting Your Own Data
You can compile your own dataset by recording videos, taking photos, or searching for freely available videos and images online. Unlike pulling data from an existing annotated dataset, you will need to annotate your collected images before you can use them for training. There are many popular sites for gathering photos including Unsplash, Pixabay, and Pexels (the latter two also provide videos). While gathering data, keep in mind the principles for data collection outlined earlier and attempt to keep your dataset as close to your inference environment as possible. Remember that your ‘environment’ includes all aspects of the input images: lighting, angle, objects, etc.
Consider the following images for building a model that detects license plates. While every picture includes a car with a license plate, they have very different environments. The first one depicts a nighttime scene with multiple cars; the second image shows a darker scene with rain; the third one has more than one car, includes people, and is more pixelated; and the fourth one has a different style of license plate than the other two. These are all examples of variables that need to be considered when choosing images.
NOTE: While lots of images that are almost identical won’t enhance your model, if there are many frames per second in your video, you can sample select frames that will best suit your dataset needs.
Using a Digitally Generated Dataset
Another way to generate a dataset is to use synthetic, i.e. computer generated data. This technique can be used if you are unable to gather enough data yourself. By using synthetic data, especially for training for unusual circumstances, you can generate a much larger dataset than you would otherwise be able to from real-world occurrences, resulting in better performance.
A technique that can be used to boost your current dataset is data augmentation. This involves taking the existing dataset images and flipping, rotating, cropping, padding, and otherwise modifying them to create images that are different enough to constitute a new data point. This can add variety to the training data and help avoid overfitting. Many deep learning frameworks (discussed below) include data augmentation capabilities.
Generally, annotation is the process of selecting a portion of an image and assigning a label to that region. In the case of image classification, annotation is done by assigning a single label to the entire image, rather than a specific region. Annotated data is the input for model training via supervised learning. With many annotations and enough variation in the underlying images, model training identifies distinguishing characteristics of the images and learns to detect the objects of interest, classify images, etc., depending on the type of model being trained.
Examples of annotation include: drawing bounding boxes or 3D cuboid boxes and assigning labels to objects for object detection, tracing around objects using polygonal outlines for semantic and instance segmentation, identifying key points and landmarks, and assigning labels for object landmark detection, and identifying straight lines for lane detection. For image classification, images are placed in groups, with each group corresponding to one label.
After the data is collected and annotated, it is used as input for model training. The data is fed into the DNN, which then outputs the prediction: a label for image classification, labels and bounding boxes for object detection, label maps for image segmentation, and key point sets for landmark detection - all of which are accompanied by a confidence value. The model compares the prediction to the annotations and makes adjustments to the DNN to produce better predictions. The process is repeated until you have great performance based on a variety of metrics specific to each training session or model type. To build your own computer vision model for your application, you can either start from scratch and build your own model architecture or you can use an existing one.
Transfer Learning & Retraining
Transfer learning leverages the knowledge gained from training a model on a general dataset and applies it to other, possibly more specific, scenarios. You can take an existing model that has been trained on a large, general dataset, or a dataset that is similar to yours, and re-train it on new labels that are specific to your use case. For instance, if you’re building a model that detects garbage, you could take a general object detection model and train it on specific labels relevant to your use cases such as cans, bottles, and toilet paper rolls. Later, you realize you want to be able to detect take-out containers and cutlery. You could then collect and annotate additional images that contain these new objects, adding them to your original dataset and training the model with these additional labels. This effectively updates the model so it can be applied to additional use cases.
NOTE: Model re-training has the extra advantage of requiring less training data!
Testing the Application
Finally, once you have made your model using your training data, you can test it by feeding it new, unannotated images and seeing whether the model classifies, detects, etc. how you expect it to. For instance, continuing with the example of building a model to detect birds on an outside camera feed, you could collect some new video footage from the camera, perhaps a compilation of footage over the course of a typical day, or during different weather, and examine the output from the model in real-time. While this isn’t an automated or quantified test, it enables you to quickly identify any shortcomings or edge cases that the model does not perform well on (for example during dawn and dusk, or when it is raining). You can then re-train your model using annotations from images depicting these edge cases, or at least be aware of the specific environments in which the model performs best.
Using Your Model With alwaysAI
alwaysAI enables users to deploy computer vision models to the edge quickly and easily. We’ve described a few models in the alwaysAI model catalog that can be used to build applications, and you can see descriptions of all the models in the catalog here.
Our API, edge IQ, enables you to tailor these models in your applications independent of hardware choices and develop prototypes rapidly. You can filter by certain labels, markup images with predictions, change the markup color, add text, swap out media input and output, change the engine or accelerator, and much more using the same standard library.
With alwaysAI, you can also upload a custom model into the model catalog and add it to your testing app. Or, you can simply swap out the training output files in your local starter app to quickly test out your new model.
Sign up for an alwaysAI account and get started in computer vision today!
Contributions to the article by Andres Ulloa, Todd Gleed, Eric VanBuhler, Jason Koo and Vikram Gupta