Data quantity vs Data quality

Agnes from BoBox

Data quantity refers to the amount of data that is available for analysis. In general, the more data that is available, the better the chances of identifying patterns and trends within the data. On the other hand, data quality refers to the accuracy, completeness, and relevance of the data being analyzed. High-quality data is essential for building accurate solutions based on the data. Poor data quality can lead to incorrect results and misinterpretation of data.

While having a large quantity of data can be beneficial, it is not always necessary. In some cases, a smaller dataset with high data quality can provide more meaningful insights than a larger dataset with lower quality. The overall effect of the object detection algorithm training is both related to the quantity of annotated data as well as the quality of annotations. It is obvious that the training dataset should contain multiple diverse annotated samples to be able to generalize properly and perform well in real life scenarios. In fact, it is often claimed that the more data is used for training, the better model one gets in the end. In this thread we want to discuss the issue in the context of annotation quality.

Data preparation and annotation

Proper preparation of the dataset is often the most costly and time consuming stage of object detection model development. In order to be used for training, the images first need to be acquired, then filtered in terms of requirements, and finally annotated with adequate labels.
All of the free steps might turn out to be a challenge. In many real life applications, where custom models are required, one needs to collect dataset from existing infrastructure, such as CCTV cameras.

Reviewing the camera views, angles, distance, lighting conditions is necessary to select proper data sources. The image acquisition process requires much attention to data availability, storage and preprocessing, like cutting a video footage into frames and filtering those appropriate for annotation.

The last step is image annotation. One needs to decide on annotation guidelines that determine the overall complexity of the task. A detailed and precise guideline documentation prevents annotation inconsistency. On the other hand, multitude annotation rules make the annotation process difficult and time consuming. Thus it leads to natural tradeoff between the quantity of images that can be annotated in a given time and the level of annotation accuracy/details. The popular cost-effective methods of creating large-scale datasets, such as cloud sourcing, may result in noisy, erroneous annotations. The most common errors in image annotation include:

missing bounding boxes of objects that should be annotated
redundant bounding boxes of objects that should not be annotated
inaccurate bounding boxes, both in size and/or position
incorrect class labels in multiclass annotation tasks
incorrect attribute assignments, e.g. sitting person is labeled as standing.

Models trained on noisy datasets

When machine learning (ML) models are trained on noisy datasets, several issues can arise, including:

Reduced accuracy: Noisy data can cause the ML model to make incorrect predictions, reducing the overall accuracy of the model.
Overfitting: ML models trained on noisy data are more likely to overfit, meaning they become too closely tailored to the noise in the dataset, rather than the underlying patterns or relationships. Overfitting can lead to poor generalization to new data and a lack of robustness.
Unreliable insights (detections): If an ML model is trained on noisy data, the resulting insights or predictions may be unreliable, making it difficult to make informed decisions based on the output.
Increased computational costs: ML models trained on noisy data may require more computational resources and time to train, as the noise can make it more difficult for the model to learn the underlying patterns in the data.

So how does it correlate to the data annotation part and what are the basic issues with it?
The effect of annotation noise in the training dataset on the model’s performance has been widely discussed and studied by ML experts. A common theory states that the errors in labeling may be neutralized by adding more training data. Indeed, the statement has been proved in several research papers and experiments, with the assumption that annotation noise in the dataset is random.

The randomness of annotation errors can be explained as equiprobable occurrences of errors of various types. However, in many real life scenarios this assumption is not valid. Labeling mistakes made by human annotators may be biased by their level of expertise, individual perception or tools used for the process. Consider as example the incorrect class labels issue – if labels are assigned by choosing an item on a predefined ordered list, a misclick would result in mutual replacement of the class labels which are neighbors on the list, while the likelihood of replacement of the very first and last item in the list is negligible. Incorrect attributes might result from individual perception of e.g colors. Inaccuracy of bounding box size and position might be an effect of insufficient training or imprecise annotation guidelines. In such situations, the annotation errors are rather systematic and the bias observed in the training dataset is transferred to the model.

How to avoid noise in annotations?

Since data annotation is a manual operation, there is no perfect solution to avoid such problems. As we said at the beginning, the amount of data will not solve this problem, even large datasets contaminated with bad annotations can have significant consequences in the form of poor quality models, and this translates into real costs: extended implementation time, disappointed customers, infrastructure costs.

To reduce this type of risk to a minimum, it is necessary to introduce a well-supervised data labeling process, where the expertise of specialists turns out to be crucial for early detection of irregularities. A successful review of the data and testing the parameters of the trained models in an iterative approach will be the most important stage here, so as not to invest too much time and resources until the desired data quality is achieved.

If you do not feel strong enough to organize this type of team and process yourself, do not hesitate to ask for help from companies such as ours. With a good specialized team, your chances increase, and why should you learn from your own mistakes? Using the knowledge transferred, in a relatively short time you will gain the necessary knowledge to continue this type of work on your own.