Introduction to Computer Vision Model Training
Training a computer vision model is one component of a complex and iterative undertaking, which can often seem daunting. At alwaysAI we want to make the process simple and approachable. To get you started, we have compiled a general overview of the training process of Deep Neural Networks (DNNs) for use in computer vision applications. We will focus on supervised learning in this overview, which uses labeled training data to teach the model what the desired output is. This article provides an introduction to each component of the model training process and will be the contextual basis for more in-depth articles in the future.
Types of Models
Different computer vision models help us answer questions about an image. What objects are in the image? Where are those objects in the image? Where are the key points on an object? What pixels belong to each object? We can answer these questions by building different types of DNNs. These DNNs can then be used in applications to solve problems like determining how many cars are in an image, whether a person is sitting or standing, or whether an animal in a picture is a cat or a dog. We’ve outlined a few of the most common types of computer vision models and their use cases below.
NOTE: In general, computer vision model output consists of a label and a confidence or score, which is some estimate of the likelihood of correctly labeling the object. This definition is intentionally vague, as ‘confidence’ will mean very different things for different types of models.
In describing the different types of models and their use cases, we’ll outline an example use case of a virtual wardrobe: an application that lets users try on different clothing items virtually, before making a purchase for instance.
Image classification attempts to identify the most significant object class in the image; in computer vision, we refer to each class as a label. For example, we can use a general classification model, such as ‘alwaysai/googlenet’, to identify items of clothing, such as ‘running shoes’ or a ‘sweatshirt’, as shown below. The model would take an image as input and it would output a label along with the confidence the model has for the particular label compared to other labels. Additionally, DNNs for image classification tasks do not provide the location of the object in the image, so for use cases where we need this information, in order to track or count objects for example, we need to use an object detection model, which is the next model described.
Caption: examples of a general classification model used to classify clothing items.
When the location of the object is of importance, object detection DNNs are typically used. These networks return a set of coordinates, called a bounding box, that specify an area of the input image containing an object, along with confidence value for that bounding box and a label. For our ‘virtual wardrobe’ application, we would need an input image of the person who wants to try on the virtual clothes, and then we would need to find the person in the image. To do this, we could use an object detection model for person detection, such as ‘alwaysai/mobilenet_ssd’, that would return a bounding box around every person in an image, along with a label, ‘person’, and a confidence value for the output. An example of an object detection model that can distinguish people is shown below.
NOTE: Knowing the location of objects in a frame allows us to infer certain information about an image. For instance, we could count how many vehicles there are on a freeway to map out traffic patterns. We can also extend the functionality of an application by piggy-backing a classification model onto an object detection model. For instance, we could feed the portion of the image corresponding to the bounding box from the detection model into the classification model so we could count how many trucks there are in the image versus sedans.
Caption: an example of two detection model’s outputs. In this case, we can see that both models can detect the person, but one model is able to detect both types of water bottles, whereas the other model has only found one.
Now we’ve seen how we can classify clothing into groups, such as shoes or sweatshirts, and we can detect a person in an image, but we still need to be able to let the user try on the clothes. This requires being able to distinguish the pixels that belong to the detected object from the pixels in the rest of the image, and in that case we would want to use segmentation, which we’ll cover next.
As we described above, in some tasks it is important to understand the exact shape of the object. This requires generating a pixel level boundary for each object, which is achieved through image segmentation. DNNs for image segmentation classify each pixel in an image by either object type, in the case of semantic segmentation, or by individual objects, in the case of instance segmentation.
NOTE: Currently, the alwaysAI platform supports semantic segmentation. We are always looking to grow the platform and are adding new models, including models that perform instance segmentation.
For our virtual wardrobe application, we could use semantic segmentation to distinguish the pixels that belong to the person, sweater, or shoe. We can then replace the pixels in one object with those from another object, such as replacing a sweater that the person in the original image was wearing with the pixels from the new sweater that the user wants to try on.
NOTE: a popular use case for semantic segmentation is the virtual background used for tele-conferencing software like Zoom or Microsoft Teams: the pixels belonging to a person are distinguished from the rest of the image and are segmented out from the background.
In our examples for image segmentation applications for people, we saw that image segmentation enables us to distinguish which pixels belong to each object in an image. However, image segmentation doesn’t enable us to infer anything about the relative position of objects in the image, such as where a person’s hand is or where the taillights and bumper are on a car.
For that, we would need information about specific areas on a person, say where a person's hand is compared to their head. This requires tracking object landmarks, which is the last model type we’ll cover here.
Object Landmark Detection
Object landmark detection is the labeling of certain ‘keypoints’ in images that capture important features in the object. For our virtual wardrobe, we could use a pose estimation model, such as ‘alwaysai/human-pose’, that identifies body keypoints such as hips, shoulders and elbows, similar to the image shown below, to help our users accessorize. We could use eye keypoints to place glasses or a hat on the person in our virtual wardrobe, or use the ‘neck’ keypoint to let them try on a scarf.
NOTE: another useful application that uses keypoints would be one that checks for proper form during exercises and sports.
Caption: body keypoints detected using 'alwaysai/human-pose'.
Model Applications in General
Computer vision models can be applied to a whole host of various applications. You could build a classification model for classifying types of dogs in a dog show, or build a detection model to find cancerous cells in biopsy slides. Conservation biologists could use one model to detect the presence of a particular family or genus, and then feed that output to a classification model to determine species, and then aggregate this data to track conservation efforts.
Semantic segmentation is used for self-driving car technologies. An object detection model could be used to count items and generate inventory in a grocery store. The possibilities for computer vision applications are extensive!
Caption: example output from the starter app 'semantic_segmentation_cityscape'.
Computer vision model training begins with assembling a quality dataset. As the adage goes, “garbage in, garbage out”. But what constitutes “garbage” for a computer vision dataset? In computer vision, “inference” is the term we use for applying a trained model to an input to infer an outcome. We like to say “train as you will inference”. So, a good rule of thumb for a quality computer vision dataset is that it is similar to the real-world data that will be input into the trained model. To ensure this, consider the angle of the image, the lighting and weather, whether the desired objects are obscured, how far away the desired targets are, the resolution and scale of the image, as well as the background and foreground.
Types of Dataset Generation
There are a few different ways to generate a dataset, depending on your timeline and desired use cases. If you want a general detection model to fit into a prototype application right away, you may want to try to find an existing dataset complete with annotations. If you instead want a model that does one specific task very well, you’ll probably need to collect your own images that more closely resemble the environment the model will be used in.
For instance, if you just want a model that will detect birds in general, you could probably go with a pre-existing dataset of images containing birds in different settings; however, if you are that conservation biologist we described earlier and you want a model that will be used to consistently detect birds appearing on the feed from a particular camera you’ve set up, you should collect images from the perspective of that camera to train the model on.
We’ll describe both of these options, as well as a couple others, below. We’ll also cover data augmentation, which you can use to increase the size of any dataset.
Using an Existing Annotated Dataset
Depending on what you intend your model to detect there may be freely available annotated datasets for you to use, which can drastically cut down the time it takes to train and deploy your desired model.
However, you will have much less control over the data quality this way, since the training data might not as closely resemble the data you'll be using as input into the model, and, as such, the performance of your model may suffer. Therefore this approach may be best for proof of concept projects, and you will likely need to generate your own specific dataset at a later date, depending on your specific application.
Some popular existing datasets include Common Objects in Context (COCO), PASCAL Visual Object Classes (VOC), ImageNet, and Google’s Open Images Dataset V6. Some public datasets have non-commercial licenses, so remember to always check the license of any existing dataset you use.
Using Existing Data or Collecting Your Own Data
You can compile your own dataset by recording video, taking photos, or searching for freely available videos and images online. Unlike pulling data from an existing annotated dataset, you will need to annotate your collected images before you can use them for training. There are many popular sites for gathering photos including Unsplash, Pixabay, Pexels; and the latter two also provide videos. While gathering data, keep in mind the principles for data collection outlined earlier and attempt to keep your dataset as close to your inference environment as possible, remembering that ‘environment’ includes all aspects of the input images: lighting, angle, objects, etc.
Consider the following images for building a model that detects license plates. While every picture includes a car with a license plate, they have very different content. The first one depicts a nighttime scene with multiple cars; the second image shows a darker scene, but still daylight, with rain; the third one has more than one car, includes people and is more pixelated; and the fourth one has a different style of plate than the other two. These are all examples of variables that need to be considered when choosing images.
NOTE: As lots of images that are almost identical won’t enhance your model, if there are many frames per second in your video, sample your to select frames that will best suit your dataset needs.
Using a Digitally Generated Dataset
Another way to generate a dataset is to use synthetic, i.e. computer generated data. This technique is used if you are unable to gather enough data yourself. By using synthetic data, especially for training unusual circumstances, you can generate a much larger dataset than you would otherwise be able to gather from real-world occurrences, resulting in better performance.
Caption: a synthetic image.
Finally, a technique that can be used to boost your current dataset is data augmentation. This involves taking the existing dataset images and flipping, rotating, cropping, padding, or otherwise modifying them, to create images that are different enough to constitute a new data point; this can add variety to the training data and help avoid overfitting. Many deep learning frameworks (discussed below) include data augmentation capabilities.
Generally, annotation is the process of selecting a portion of an image and assigning a label to that region. In the case of image classification, annotation is done by assigning a single label to the entire image, rather than a specific region. Annotated data are the input for model training via supervised learning. With many annotations and with enough variation in the underlying images, the model training identifies distinguishing characteristics of the images, and learns to detect the objects of interest, classify images, etc., depending on the type of model being trained.
Caption: examples of different annotation types. Instance segmentation of two cars on a highway in the upper left; 3D cuboid boxes in the upper right. Straightline annotations in a highway example are shown on the bottom left, and keypoint annotations in the bottom right.
Examples of annotation include: drawing bounding boxes or 3D cuboid boxes and assigning labels to objects for object detection, tracing around objects using polygonal outlines for semantic and instance segmentation, identifying key points and landmarks and assigning labels for object landmark detection, and identifying straight lines, such as is used in lane detection. For image classification, images are placed in groups, with each group corresponding to one label.
After the data are collected and annotated, they are used as input for model training. The data are fed into the DNN, which then outputs the prediction: a label for image classification, labels and bounding boxes for object detection, label maps for image segmentation, and keypoint sets for landmark detection, all of which are accompanied by a confidence. The model compares the prediction to the annotations, and makes adjustments to the DNN to produce better predictions. The process is repeated until you have good performance, based on a variety of metrics specific to each training session or model type. To build your own computer vision model for your application, you can start from scratch and build your own model architecture, or you can use an existing one.
Transfer Learning & Retraining
Transfer learning leverages the knowledge gained from training a model on a general dataset, and applies it to other, possibly more specific, scenarios. You can take an existing model that has been trained on a large, general dataset, or a dataset that is similar to yours, and re-train on new labels that are specific to your use case. For instance, if you’re building a model that detects garbage, you could take a general object detection model and train it on specific labels relevant to your use case such as cans, bottles, and toilet paper rolls. Later, you realize you want to be able to detect take-out containers and cutlery. You could then collect and annotate additional images that contain these new objects, adding them to your original dataset, and train the model with these additional labels. This effectively updates the model so it can be applied to additional use cases.
NOTE: Model re-training has the extra advantage of requiring less training data!
Finally, once you have made your model using your training data, you can test it by feeding in new, unannotated images and seeing whether the model classifies, detects, etc. how you expect. For instance, continuing with the example of building a model to detect birds on an outside camera feed, you could collect some new video footage from the camera, perhaps a compilation of footage over the course of a typical day, or during different weather, and examine the output from the model in real time. While this isn’t an automated or quantified test, it enables you to quickly identify any shortcomings or edge cases that the model does not perform well on, such as during dawn and dusk, or when there it is raining. You can then re-train your model using annotations from images depicting these edge cases, or at least be aware of the specific environments in which the model performs best.
Using Your Model on AlwaysAI
AlwaysAI enables users to get up and running and deploy computer vision models to the edge quickly and easily. We’ve described a few models existing in the alwaysAI model catalog that can be used to build applications we highlighted as examples in this article; you can see descriptions of all the models in the catalog here.
Our API, edge IQ, enables you to tailor using these models in your applications independent of hardware choices and develop prototypes rapidly. You can filter by certain labels, markup images with predictions, change the markup color, add text, swap out media input and output, change the engine or accelerator, and much more, by using the same standard library.
With alwaysAI, you can also upload a custom model into the model catalog and add it to your testing app, or you can simply swap out the training output files in your local starter app, to quickly test out your new model.
Head to alwaysai.co, sign up to use our platform, and get started in computer vision today!
Contributions to the article by Andres Ulloa, Todd Gleed, Eric VanBuhler, Jason Koo and Vikram Gupta