HADA

Research on advanced data annotation tools

Challenge

High quality data is the cornerstone on which AI is based. But the data generation process needs a lot of human intervention, which causes problems of inconsistency, human failures, high costs in time and money. It is also difficult to source tools that can provide data with the following characteristics: high quality, regulatory compliant, securely generated, scalable and fast, affordable, flexible, consistent, balanced representation of the domain they are trying to represent and accurate in terms of annotation. 

The objective of the HADA project is to design a set of data annotation tools for the most used data inputs for AI: Voice, Text and Image. This will allow Sigma to have an advanced annotation tool framework that will increase and speed up its services around data annotation and will pave the way to the commercialization of annotation tools.

The project will research and address the stages of the machine learning lifecycle:

  1. Data preparation and selection: Data considered most relevant to the improvement of AI models shall be selected and prepared in such a way as to make manual annotation easier.
  2. Annotation:  Develop technologies that simplify the activity of annotators to speed up the process and improve quality.
  3. Quality control: Establish solutions that help detect and correct errors while increasing consistency between different scorers.

 

This solution will be a scientific annotation framework of Human-in-the-Loop Artificial Intelligence (HITL AI)

Solution

HADA, or the Investigación de Herramientas de Anotación de Datos Avanzadas(Investigation of Advanced Data Annotation Tools) is an individual industrial research project, which supports Sigma’s continued offering of data annotation services. The proposed solution will optimize annotation tasks to obtain quality data, faster and more accurately than existing tools. The HADA tools are designed so they can work in combination with most existing annotation tools.
These advanced tools will:
  • Reduce the time required for human data annotation through new algorithms that automate repetitive tasks and redirect human effort to higher value-added tasks.
  • Assess and select the data to be annotated based on the quality the data offers to improve the models’ performance and minimize biases.
  • Establish mechanisms to ensure the quality of the data.
  • Ensure compliance with data protection regulations.

The tools under development support the entire data annotation process and include:

Active Learning: Research and implementation of hybrid unsupervised and semi-supervised models to reduce the need for large labeled data sets.

Data Anonymization: Application of anonymization on the algorithms used for data selection, annotation support and quality control. Automatic Distractors Removal through AI modeling and data enhancement.

Decision Reduction: AI model to assist in the intelligent reduction of labelling options provided to the annotator, tending to binary classification problems.

Multiple Annotation: Intelligent data clustering algorithms that allow simultaneous annotation of more than one sample at a time.

Automatic Error Detection: Automatic annotation error detection using unsupervised learning techniques.

Results

The project started at the end of 2022 and is expected to be completed by mid-2024.

 

Funded by

The project 2021/C005/00146323 is funded by the EU Next Generation through the public business entity attached to the Ministry of Economic Affairs and Transformation.

 

Partners

The project will be developed entirely by Sigma Cognition, with the support of two specialized groups from the Polytechnic University of Madrid (UPM) and Universidad Carlos III. 

Project News and Events

  • June 14, 2024: Special session on “Research on advanced AI-based data annotation tools” as part of the International Conference on Artificial Intelligence Applications & Innovations.
  • June 26-28, 2024: DCAI Salamanca. Join us for a special session on “Advanced AI-based Data Annotation Tools,” as part of the 21st International Conference on Distributed Computing and Artificial Intelligence 2024.

Publications

  • Anotación de datos para analítica de conversaciones (White Paper)
  • Preparación de datos para proyectos de Visión por Computador (White paper)
  • TRIPTICO_HADA_VF.pdf
  • Llerena, J. P., Patricio, M. A., Molina, J. M., Mora-Sánchez, A. & Rodríguez-Jiménez, S. (2024). Innovative Quality Metrics for Enhanced Interpretation of Instance Segmentation in Complex Image Scenarios. Distributed Computing and Artificial Intelligence, Special Session on Advanced AI-based Data Annotation Tools (AI-DAT), 21st International Conference. DCAI 2024.
  • Gutiérrez-Navarro, J. , Mora-Sánchez, A., Rodríguez-Jiménez, S. & Blanco-Murillo, J. L. (2024). AI-Boosted Video Annotation: Exploring Pre-Labeling with Cross-Modalities. Distributed Computing and Artificial Intelligence,Special Session on Advanced AI-based Data Annotation Tools (AI-DAT), 21st International Conference. DCAI 2024.
  • Fernández-Castañón, R., Espinoza-Cuadros, F. M., Perero-Codosero, J. M., Sancho-Lozano, E. & Hernández-Gómez, L. A. (2024). Can Large Sound Event Detection models be accurately adapted to specific acoustic scenarios?. Distributed Computing and Artificial Intelligence, Special Session on Advanced AI-based Data Annotation Tools (AI-DAT), 21st International Conference. DCAI 2024.
  • Espinoza-Cuadros, F. M., Ginard-Aguilera, R. & Perero-Codosero, J. M. (2024). How Does Speech Quality Impact the Data Transcription Process? Distributed Computing and Artificial Intelligence, Special Session on Advanced AI-based Data Annotation Tools (AI-DAT), 21st International Conference. DCAI 2024.
  • Cortón-González, J., Mora-Sánchez, A. & Rodríguez-Jiménez, S. (2024). Enhancing Image Annotation Through Attention Mining: A Grounded SAM Approach. Distributed Computing and Artificial Intelligence, Special Session on Advanced AI-based Data Annotation Tools (AI-DAT), 21st International Conference. DCAI 2024.
EN