Publicação

Text Categorization for Regulatory Compliance

Detalhes bibliográficos
Resumo:	This work explores the application of machine learning techniques, specifically Transformer-based models, to the task of legal text classification. The objective was to develop a system capable of classifying legal paragraphs into predefined categories, thus streamlining the legal document review process. This research underscores the critical role of data understanding and preparation, including meticulous preprocessing, the choice of classification granularity, and the formulation of a pertinent label set. A comparative analysis of different Transformer-based models, including BERT, RoBERTa, DistilBERT, and an ensemble model, was conducted. These models were evaluated based on their precision, recall, and F1-score on the classification task, as well as their training time. DistilBERT emerged as the most suitable model due to its balance of strong performance and efficiency. To refine the label set, this work employed a range of text mining tools and approaches to aid legal experts in identifying the main topics within the corpus. Despite the complexities of the legal text and the challenges posed by data imbalance and document format, the research successfully developed an efficient text classification system. This work concludes by discussing potential future directions, primarily related to the advent of large language models (LLMs). The potential of these models for in-context learning and topic proposal was discussed, noting the immense possibilities they bring, despite the substantial computational requirements and ethical considerations. This research contributes valuable insights to the application of machine learning in legal text analysis, and the findings provide a strong foundation for future exploration in this intersection of law and artificial intelligence.
Autores principais:	Balata, Duarte Teomóteo
Assunto:	Processamento de Linguagem Natural Documentos Legais Mineração de texto Modelos BERT IA Jurídica Teses de mestrado - 2024
Ano:	2024
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso restrito
Instituição associada:	Universidade de Lisboa
Idioma:	inglês
Origem:	Repositório da Universidade de Lisboa

Descrição
Resumo:	This work explores the application of machine learning techniques, specifically Transformer-based models, to the task of legal text classification. The objective was to develop a system capable of classifying legal paragraphs into predefined categories, thus streamlining the legal document review process. This research underscores the critical role of data understanding and preparation, including meticulous preprocessing, the choice of classification granularity, and the formulation of a pertinent label set. A comparative analysis of different Transformer-based models, including BERT, RoBERTa, DistilBERT, and an ensemble model, was conducted. These models were evaluated based on their precision, recall, and F1-score on the classification task, as well as their training time. DistilBERT emerged as the most suitable model due to its balance of strong performance and efficiency. To refine the label set, this work employed a range of text mining tools and approaches to aid legal experts in identifying the main topics within the corpus. Despite the complexities of the legal text and the challenges posed by data imbalance and document format, the research successfully developed an efficient text classification system. This work concludes by discussing potential future directions, primarily related to the advent of large language models (LLMs). The potential of these models for in-context learning and topic proposal was discussed, noting the immense possibilities they bring, despite the substantial computational requirements and ethical considerations. This research contributes valuable insights to the application of machine learning in legal text analysis, and the findings provide a strong foundation for future exploration in this intersection of law and artificial intelligence.