Publicação

Decoding the numbers and language behind financial statement fraud

Detalhes bibliográficos
Resumo:	Financial statement fraud costs companies, in addition to corruption and asset misappropriation, over 5 trillion US dollars annually. The timely detection of this offense plays a crucial role in the damage suffered. Therefore, automated methods capable of identifying high-probability fraud occurrences are essential. Therefore, this study evaluates the potential of Large Language Models (LLMs) such as BERT and FinBERT by comparing their performance to that of well-established models like the Logistic Regression and the XGBoost. To accomplished this, in our study, we went over the Management’s Discussion & Analysis (MD&A) section of 1850 10-K reports (1436 non-fraud and 414 fraud), alongside financial ratios and raw accounting variables from companies which were known to have manipulated at least a single report in the past spanning from 1993 to 2014. Models were trained using three variable types: financial, text, and a combination of both. Evaluation was done using three metrics, AUC, NDCG@k and a threshold-based ‘Capture’, as to the specific problem, probabilities can be more informative than labels. The results suggest that the last part of the MD&A section captures more relevant information than the beginning. Additionally, rank-averaging predictions from models based on the first and last parts of the section did not yield significant improvements despite the improved capture. FinBERT outperformed BERT and achieved AUC scores comparable to traditional models that leverage OpenAI’s ‘text-embedding-3-large’ and surpass them in both NDCG@k and capture rates. Thus, FinBERT’s domain-specific pretraining proved to be particularly advantageous in enhancing fraud detection performance.
Autores principais:	Oliveira, João de Brito Brás de
Assunto:	Fraud detection Demonstração financeira -- Financial statement SEC Deep learning Machine learning -- Machine learning LLM
Ano:	2024
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	ISCTE
Idioma:	inglês
Origem:	Repositório ISCTE

Descrição
Resumo:	Financial statement fraud costs companies, in addition to corruption and asset misappropriation, over 5 trillion US dollars annually. The timely detection of this offense plays a crucial role in the damage suffered. Therefore, automated methods capable of identifying high-probability fraud occurrences are essential. Therefore, this study evaluates the potential of Large Language Models (LLMs) such as BERT and FinBERT by comparing their performance to that of well-established models like the Logistic Regression and the XGBoost. To accomplished this, in our study, we went over the Management’s Discussion & Analysis (MD&A) section of 1850 10-K reports (1436 non-fraud and 414 fraud), alongside financial ratios and raw accounting variables from companies which were known to have manipulated at least a single report in the past spanning from 1993 to 2014. Models were trained using three variable types: financial, text, and a combination of both. Evaluation was done using three metrics, AUC, NDCG@k and a threshold-based ‘Capture’, as to the specific problem, probabilities can be more informative than labels. The results suggest that the last part of the MD&A section captures more relevant information than the beginning. Additionally, rank-averaging predictions from models based on the first and last parts of the section did not yield significant improvements despite the improved capture. FinBERT outperformed BERT and achieved AUC scores comparable to traditional models that leverage OpenAI’s ‘text-embedding-3-large’ and surpass them in both NDCG@k and capture rates. Thus, FinBERT’s domain-specific pretraining proved to be particularly advantageous in enhancing fraud detection performance.