Publicação

Cost-Sensitive Machine Learning for Peer-to-Peer Credit Scoring: Balancing Predictive Accuracy, Risk and Regulatory Compliance

Detalhes bibliográficos
Resumo:	Credit scoring tools are frequently used to assess a client's capacity to repay a loan, and this instrument is no exception in the context of peer-to-peer (P2P) lending. A key challenge in this context is the issue of imbalanced datasets, as reliable (non-defaulting) clients typically outnumber those who default. In this study, we propose a refined approach that addresses not only class imbalance but also the challenges of feature selection and interpretability that emerge when employing more complex, yet more accurate, models. We explore several machine learning algorithms for predicting loan default risk, including Logistic Regression, Random Forest, XGBoost, and LightGBM. To tackle the imbalance problem, we apply a costsensitive learning strategy and compare it with a traditional data sampling method. For interpretability, we employ SHapley Additive exPlanations (SHAP) to assess both global and instance-level feature importance. Our experiments are conducted on a widely used realworld dataset from the Lending Club. Performance is evaluated using both class-specific metrics and global metrics such as the Area Under the Curve (AUC).
Autores principais:	Ramos, Sophia Mizinski
Assunto:	Cost-sensitive Learning Credit Scoring Feature Selection Risk Imbalanced Dataset Mutual Information Machine Learning SDG 1 - No poverty SDG 8 - Decent work and economic growth SDG 9 - Industry, innovation and infrastructure SDG 10 - Reduced inequalities SDG 16 - Peace, justice and strong institutions
Ano:	2025
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	Universidade Nova de Lisboa
Idioma:	inglês
Origem:	Repositório Institucional da UNL

Descrição
Resumo:	Credit scoring tools are frequently used to assess a client's capacity to repay a loan, and this instrument is no exception in the context of peer-to-peer (P2P) lending. A key challenge in this context is the issue of imbalanced datasets, as reliable (non-defaulting) clients typically outnumber those who default. In this study, we propose a refined approach that addresses not only class imbalance but also the challenges of feature selection and interpretability that emerge when employing more complex, yet more accurate, models. We explore several machine learning algorithms for predicting loan default risk, including Logistic Regression, Random Forest, XGBoost, and LightGBM. To tackle the imbalance problem, we apply a costsensitive learning strategy and compare it with a traditional data sampling method. For interpretability, we employ SHapley Additive exPlanations (SHAP) to assess both global and instance-level feature importance. Our experiments are conducted on a widely used realworld dataset from the Lending Club. Performance is evaluated using both class-specific metrics and global metrics such as the Area Under the Curve (AUC).