Publicação
Cost-Sensitive Machine Learning for Peer-to-Peer Credit Scoring: Balancing Predictive Accuracy, Risk and Regulatory Compliance
| Resumo: | Credit scoring tools are frequently used to assess a client's capacity to repay a loan, and this instrument is no exception in the context of peer-to-peer (P2P) lending. A key challenge in this context is the issue of imbalanced datasets, as reliable (non-defaulting) clients typically outnumber those who default. In this study, we propose a refined approach that addresses not only class imbalance but also the challenges of feature selection and interpretability that emerge when employing more complex, yet more accurate, models. We explore several machine learning algorithms for predicting loan default risk, including Logistic Regression, Random Forest, XGBoost, and LightGBM. To tackle the imbalance problem, we apply a costsensitive learning strategy and compare it with a traditional data sampling method. For interpretability, we employ SHapley Additive exPlanations (SHAP) to assess both global and instance-level feature importance. Our experiments are conducted on a widely used realworld dataset from the Lending Club. Performance is evaluated using both class-specific metrics and global metrics such as the Area Under the Curve (AUC). |
|---|---|
| Autores principais: | Ramos, Sophia Mizinski |
| Assunto: | Cost-sensitive Learning Credit Scoring Feature Selection Risk Imbalanced Dataset Mutual Information Machine Learning SDG 1 - No poverty SDG 8 - Decent work and economic growth SDG 9 - Industry, innovation and infrastructure SDG 10 - Reduced inequalities SDG 16 - Peace, justice and strong institutions |
| Ano: | 2025 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Universidade Nova de Lisboa |
| Idioma: | inglês |
| Origem: | Repositório Institucional da UNL |
| Resumo: | Credit scoring tools are frequently used to assess a client's capacity to repay a loan, and this instrument is no exception in the context of peer-to-peer (P2P) lending. A key challenge in this context is the issue of imbalanced datasets, as reliable (non-defaulting) clients typically outnumber those who default. In this study, we propose a refined approach that addresses not only class imbalance but also the challenges of feature selection and interpretability that emerge when employing more complex, yet more accurate, models. We explore several machine learning algorithms for predicting loan default risk, including Logistic Regression, Random Forest, XGBoost, and LightGBM. To tackle the imbalance problem, we apply a costsensitive learning strategy and compare it with a traditional data sampling method. For interpretability, we employ SHapley Additive exPlanations (SHAP) to assess both global and instance-level feature importance. Our experiments are conducted on a widely used realworld dataset from the Lending Club. Performance is evaluated using both class-specific metrics and global metrics such as the Area Under the Curve (AUC). |
|---|