Taskforce oriented to deliver best quality dataset for AI models

best quality dataset

When approaching building an AI model, one should decide on the right dataset to be used for training and evaluation of the developed model. If you are an ML hobbyist, the generally accessible dataset will probably suffice. But models built for industrial applications usually require more attention on data preparation and fine tuning (see our post on data-centric approach). It is both about the size of the dataset, often exceeding tens of thousands of data points, as well as deciding on proper features to be annotated for specified use case.

ML Specialist

The ML developer should be included in the process of defining all requirements for the dataset and iteratively in the annotation iterations if additional decisions are needed. As effectivity of the annotation team results from a reliable and clear documentation, this part of the annotation project is crucial for its success. Hence, whenever needed, this preparation step should be assisted by an experienced consultant, who knows both the annotation process and understands the ML requirements.

Annotator(s)

This is the main task force providing the annotated data. It may be a single man as well as an army, depending on the timeline and amount of data. It is the reliability and understanding of the annotation matter of the annotation team that decides on the final quality of the dataset. Therefore before the annotation starts, all task details should be explained at a joint training session with the specialist/consultant.

Project Manager

For a huge amount of data to be annotated, you will need to properly manage the team working on annotations. Project Manager will divide the project scope into subtasks and assign them among team members. He will make sure the team is properly trained for the task and understands the requirements. PM will also monitor the progress, assign reviewers for the completed tasks and keep an eye on the timeline.  He will also collect the feedback from the annotation team and cooperate with the consultant to clarify the requirements in doubtful cases.

Reviewer

Dataset quality assurance is an important  issue in dataset preparation. Even best human annotators can suffer from laborious tasks and may omit or mislabel data. To ensure the quality, the review step should be integrated in the process to catch sporadic and systematic errors and re-iterate the annotation. The Reviewer may be assigned within the annotation team (cross-reviewing) or be a separate QA specialist.