Publicação

Data quality dimensions: assessment and improvement

Ver documento

Detalhes bibliográficos
Resumo:Data Quality is a concept deeply close to the context and is often defined as “fitness for purpose”, meaning it directly relates to the user's objectives. Data Quality Problems can lead to financial losses for organizations, and extensive preprocessing time of datasets, as well as resulting in low accuracy in Machine Learning models. To assess data quality various dimensions have been introduced, focusing on five Core Dimensions, such as completeness, uniqueness, accuracy, validity, and consistency. This work addresses Data Quality Dimensions, highlighting the gap identified in the scientific literature regarding the lack of automated tools that require minimal human intervention for Data Quality Assessment and Improvement. To address this issue, we present a pipeline containing data quality techniques and their corresponding taxonomy, along with relevant code lines implemented in the R language to evaluate and improve five of the core dimensions. By applying this code to an artificial or Synthetic Dataset, we improved data quality issues by 100% in all dimensions in just two iterations, without the need for human intervention.
Autores principais:António, Francisco de Araújo
Assunto:Data Quality Core Dimensions Data Quality Problem Assessment Improvement Synthetic Dataset
Ano:2024
País:Portugal
Tipo de documento:dissertação de mestrado
Tipo de acesso:acesso aberto
Instituição associada:Universidade de Trás-os-Montes e Alto Douro
Idioma:inglês
Origem:Repositório da UTAD
Descrição
Resumo:Data Quality is a concept deeply close to the context and is often defined as “fitness for purpose”, meaning it directly relates to the user's objectives. Data Quality Problems can lead to financial losses for organizations, and extensive preprocessing time of datasets, as well as resulting in low accuracy in Machine Learning models. To assess data quality various dimensions have been introduced, focusing on five Core Dimensions, such as completeness, uniqueness, accuracy, validity, and consistency. This work addresses Data Quality Dimensions, highlighting the gap identified in the scientific literature regarding the lack of automated tools that require minimal human intervention for Data Quality Assessment and Improvement. To address this issue, we present a pipeline containing data quality techniques and their corresponding taxonomy, along with relevant code lines implemented in the R language to evaluate and improve five of the core dimensions. By applying this code to an artificial or Synthetic Dataset, we improved data quality issues by 100% in all dimensions in just two iterations, without the need for human intervention.