Author(s): Moreno, María Fernanda Osorio
Date: 2018
Persistent ID: http://hdl.handle.net/10362/33863
Origin: Repositório Institucional da UNL
Subject(s): Imbalanced datasets; Fraud; oversampling; Insurance
Author(s): Moreno, María Fernanda Osorio
Date: 2018
Persistent ID: http://hdl.handle.net/10362/33863
Origin: Repositório Institucional da UNL
Subject(s): Imbalanced datasets; Fraud; oversampling; Insurance
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics
Although the current trend of data production is focused on generating tons of it every second, there are situations where the target category is represented extremely unequally, giving rise to imbalanced datasets, analyzing them correctly can lead to relevant decisions that produces appropriate business strategies. Fraud modeling is one example of this situation: it is expected less fraudulent transactions than reliable ones, predict them could be crucial for improving decisions and processes in a company. However, class imbalance produces a negative effect on traditional techniques in dealing with this problem, a lot of techniques have been proposed and oversampling is one of them. This work analyses the behavior of different oversampling techniques such as Random oversampling, SOMO and SMOTE, through different classifiers and evaluation metrics. The exercise is done with real data from an insurance company in Colombia predicting fraudulent claims for its compulsory auto product. Conclusions of this research demonstrate the advantages of using oversampling for imbalance circumstances but also the importance of comparing different evaluation metrics and classifiers to obtain accurate appropriate conclusions and comparable results.