Keypoint annotation when boxes are not sufficient!

Agnes from BoBox

Computer vision algorithms are fed with visual data with additional layers of information typically provided by human annotators. This layer holds information on the content of the image like shapes or objects in the scene, as well as their position, type or attributes. There are several ways to provide such information for an AI model to be trained and evaluated in terms of its intended use.

Why keypoint annotation?

The most popular are the rectangular envelopes for objects in the image, so called bounding boxes. They are simple and quick to draw, but depending on the annotated object features such rectangular envelopes may also cover a significant portion of the background. The object detection algorithms trained on such annotations also return their predictions in the same form. Such output turns out to be sufficient in many real-life use cases, for example object counting.

However some computer vision tasks require more detailed information, like the exact shapes of the detected objects. For this purpose the algorithms need to be trained on more detailed information. Polygon annotations are used to provide precise information on the object’s envelope, which in this case can be an arbitrary polygon. Such annotation is definitely more costly, however provides enough flexibility to define even complex shapes.

Unlike the rectangular bounding boxes or polygon annotations, key points allow the algorithms to recognize not only the objects in the scene, but also their pose, key features and their mutual positions. This kind of information becomes crucial in some important computer vision applications, including pose detection, emotion detection or face recognition. Keypoints are defined as pinpoints of the key features of the object. For example, a human face is characterized by eyes, nose, mouth, eyebrows etc. Their mutual position, distances and individual shapes allow among others for face reidentification and emotion detection. The position of pinpoints like joints in the human silhouette allow to determine the pose of the person and prediction of his/her moves. In many cases, particular key points can be naturally grouped into larger collections by introducing additional connections between them (like bones connecting joints in a human body). Such a collection of keypoints is called a skeleton. Both keypoints and skeletons are used to train AI models able to detect object pose, movement, recognize activity or assess the object features for compliance with the standard.

Areas of application

Keypoint annotation plays a vital role in many computer vision applications. See the following examples where keypoint annotation is a crucial component:

Face recognition – in order to properly recognize a face, the algorithm needs to detect its key features and assign their unique combination to an individual. For this purpose keypoints like tip of the nose, boundaries of the lips, eyes and eyebrows, ears, chicks etc are used to define the portrait and compare it with the gallery for identification.
Emotion detection – similar key point data are used for emotion detection. Face mimics are classified into a given class of emotions based on the mutual position of the key points as well as the angles in the skeletons. A particular area of deployment of such algorithms are driver monitoring systems detecting signs of driver distraction or drowsiness. The technology helps prevent accidents by alerting drivers when they exhibit behaviors that may compromise safety.
Human Pose Estimation – by marking specific key points on a human body, such as joints, one can create a model of poses. Understanding the human pose is valuable insight in various areas of application. For example, it is used in sports analytics to analyze player movements to enhance athletic performance or detect irregularities that may indicate potential injuries. In healthcare, human pose estimation is used for musculoskeletal injury prevention or rehabilitation, or motor assessment of persons with neurologic damage or disease.
Gesture Recognition – human gestures are a crucial element of a human-machine interaction in virtual and augmented reality environments. Key point annotation allows to model the gestures that trigger the machine response to human actions.
Medical Imaging – another area of application of key point annotation in healthcare is automated analysis of medical images such as X-ray, ultrasound or MRI scans. The algorithms can be trained to recognize the specific points of interest as well as their dimensions or angles. For instance, annotated keypoints aid in precise diagnosis, treatment planning, and monitoring of dental conditions.

Keypoint dataset preparation

Preparation of a good quality training dataset involves manual annotation of all defined keypoints (or skeletons) in the image. Human annotators need not only recognize the object in the image, but also all of its key points and their connections. It is often difficult to indicate all key points due to the variety of orientations, occlusions and partial views of the objects. This is why the clear annotation rules, edge case decisions and exceptions should be well documented in the annotation guidelines. The complexity of annotation poses significant risk of inconsistency and errors, therefore a regular quality control should be included as a part of the annotation process. And last but not least, as the detailed annotation is also extremely costly, you may want to use a hybrid approach to support human annotators with automated pre-annotation techniques. Our experts can help you in every stage of the annotation process to deliver best quality datasets.