Publicação

Supervised clustering with SHAP values

Detalhes bibliográficos
Resumo:	In the last years, data has grown at a fast rate. Not only growing in size, data is also becoming far more complex then what it used to be. As companies are shifting to data-driven environments, this complexity dificults the analysis and extraction of value from the data. As a result traditional methods are becoming obsolete as their performance is decreasing and machine learning and deep learning models are becoming more complex so the desirable accuracy scores can be achieved. This work proposes an approach that is capable of recognizing complex relationships and identifies groups that are not visible at first glance while providing a full interpretability of the methods used. It combines a black-box model with SHAP values to generate clusters from the explanations that were previously unknown. The clusters obtained are a combination of multiple local explanations that SHAP values offer and are easily interpretable since the feature values correspond to the feature importance assigned by the model. To implement this approach, a dataset containing the properties of benign and malware samples, designed for malware detection tasks, was used. It is shown that by combining SHAP values with XGBoost it is possible to generate new clusters, that were previously hidden and unobtainable with traditional approaches. This clusters are highly interpretable as they derive from SHAP values and have the support of a supervised environment.
Autores principais:	Conceição, Rodrigo Queirós
Assunto:	SHAP values Black-Box models Interpretability Supervised Clustering XGBoost
Ano:	2023
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	Universidade de Lisboa
Idioma:	inglês
Origem:	Repositório da Universidade de Lisboa

Descrição
Resumo:	In the last years, data has grown at a fast rate. Not only growing in size, data is also becoming far more complex then what it used to be. As companies are shifting to data-driven environments, this complexity dificults the analysis and extraction of value from the data. As a result traditional methods are becoming obsolete as their performance is decreasing and machine learning and deep learning models are becoming more complex so the desirable accuracy scores can be achieved. This work proposes an approach that is capable of recognizing complex relationships and identifies groups that are not visible at first glance while providing a full interpretability of the methods used. It combines a black-box model with SHAP values to generate clusters from the explanations that were previously unknown. The clusters obtained are a combination of multiple local explanations that SHAP values offer and are easily interpretable since the feature values correspond to the feature importance assigned by the model. To implement this approach, a dataset containing the properties of benign and malware samples, designed for malware detection tasks, was used. It is shown that by combining SHAP values with XGBoost it is possible to generate new clusters, that were previously hidden and unobtainable with traditional approaches. This clusters are highly interpretable as they derive from SHAP values and have the support of a supervised environment.