Publicação
Text Categorization for Regulatory Compliance
| Resumo: | This work explores the application of machine learning techniques, specifically Transformer-based models, to the task of legal text classification. The objective was to develop a system capable of classifying legal paragraphs into predefined categories, thus streamlining the legal document review process. This research underscores the critical role of data understanding and preparation, including meticulous preprocessing, the choice of classification granularity, and the formulation of a pertinent label set. A comparative analysis of different Transformer-based models, including BERT, RoBERTa, DistilBERT, and an ensemble model, was conducted. These models were evaluated based on their precision, recall, and F1-score on the classification task, as well as their training time. DistilBERT emerged as the most suitable model due to its balance of strong performance and efficiency. To refine the label set, this work employed a range of text mining tools and approaches to aid legal experts in identifying the main topics within the corpus. Despite the complexities of the legal text and the challenges posed by data imbalance and document format, the research successfully developed an efficient text classification system. This work concludes by discussing potential future directions, primarily related to the advent of large language models (LLMs). The potential of these models for in-context learning and topic proposal was discussed, noting the immense possibilities they bring, despite the substantial computational requirements and ethical considerations. This research contributes valuable insights to the application of machine learning in legal text analysis, and the findings provide a strong foundation for future exploration in this intersection of law and artificial intelligence. |
|---|---|
| Autores principais: | Balata, Duarte Teomóteo |
| Assunto: | Processamento de Linguagem Natural Documentos Legais Mineração de texto Modelos BERT IA Jurídica Teses de mestrado - 2024 |
| Ano: | 2024 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso restrito |
| Instituição associada: | Universidade de Lisboa |
| Idioma: | inglês |
| Origem: | Repositório da Universidade de Lisboa |
| Resumo: | This work explores the application of machine learning techniques, specifically Transformer-based models, to the task of legal text classification. The objective was to develop a system capable of classifying legal paragraphs into predefined categories, thus streamlining the legal document review process. This research underscores the critical role of data understanding and preparation, including meticulous preprocessing, the choice of classification granularity, and the formulation of a pertinent label set. A comparative analysis of different Transformer-based models, including BERT, RoBERTa, DistilBERT, and an ensemble model, was conducted. These models were evaluated based on their precision, recall, and F1-score on the classification task, as well as their training time. DistilBERT emerged as the most suitable model due to its balance of strong performance and efficiency. To refine the label set, this work employed a range of text mining tools and approaches to aid legal experts in identifying the main topics within the corpus. Despite the complexities of the legal text and the challenges posed by data imbalance and document format, the research successfully developed an efficient text classification system. This work concludes by discussing potential future directions, primarily related to the advent of large language models (LLMs). The potential of these models for in-context learning and topic proposal was discussed, noting the immense possibilities they bring, despite the substantial computational requirements and ethical considerations. This research contributes valuable insights to the application of machine learning in legal text analysis, and the findings provide a strong foundation for future exploration in this intersection of law and artificial intelligence. |
|---|