Publicação

Rich Large-Scale Portuguese Language Models from Large Portuguese Corpora

Ver documento

Detalhes bibliográficos
Resumo:Language is one of the most fundamental and important characteristics of human behavior. It enables us to express ourselves and communicate, as it is a powerful and crucial tool that has helped shape our thoughts and knowledge of the world around us, since the dawn of humankind. In specific, the Portuguese language is the sixth most spoken language in the world with over 250 million speakersworldwide. Of those, over 40 million are potential European Portuguese speakers, but despite its widespread use, the development of natural language processing (NLP) tools for PT-PT has lagged behind other languages, like English or French. This is partly due to the lack of large-scale annotated datasets and the lack of computational resources dedicated to this variant of Portuguese. The NLP field has improved greatly in recent years, leading to the development of innovative tools for language analysis and processing boosted by neural language models. This is achieved by training deep, transformed-based large language models that can perform language tasks like machine translation, sentiment analysis, summarization, or even simple reasoning. This thesis aims to address these current problems and contribute to the development of PT-PT NLP tools by presenting our own generative model, GlórIA. It can properly model the intricacies of the Portuguese language and is proficient at multiple natural language tasks. We present training techniques and protocols, following proper evaluation and comparison against other recent Portuguese models on several downstream tasks. Consequently, a PT-PT corpora, composed of different sources of data, was built to train it and is presented in this work to combat the lack of publicly available datasets for this language variant. Parallel to this, a new and small benchmark was also produced to evaluate a model’s generative performance on a language modeling task.
Autores principais:Lopes, Ricardo Valverde
Assunto:Large Language Models Transformers Portuguese Natural Language Processing Datasets
Ano:2023
País:Portugal
Tipo de documento:dissertação de mestrado
Tipo de acesso:acesso aberto
Instituição associada:Universidade Nova de Lisboa
Idioma:inglês
Origem:Repositório Institucional da UNL
Descrição
Resumo:Language is one of the most fundamental and important characteristics of human behavior. It enables us to express ourselves and communicate, as it is a powerful and crucial tool that has helped shape our thoughts and knowledge of the world around us, since the dawn of humankind. In specific, the Portuguese language is the sixth most spoken language in the world with over 250 million speakersworldwide. Of those, over 40 million are potential European Portuguese speakers, but despite its widespread use, the development of natural language processing (NLP) tools for PT-PT has lagged behind other languages, like English or French. This is partly due to the lack of large-scale annotated datasets and the lack of computational resources dedicated to this variant of Portuguese. The NLP field has improved greatly in recent years, leading to the development of innovative tools for language analysis and processing boosted by neural language models. This is achieved by training deep, transformed-based large language models that can perform language tasks like machine translation, sentiment analysis, summarization, or even simple reasoning. This thesis aims to address these current problems and contribute to the development of PT-PT NLP tools by presenting our own generative model, GlórIA. It can properly model the intricacies of the Portuguese language and is proficient at multiple natural language tasks. We present training techniques and protocols, following proper evaluation and comparison against other recent Portuguese models on several downstream tasks. Consequently, a PT-PT corpora, composed of different sources of data, was built to train it and is presented in this work to combat the lack of publicly available datasets for this language variant. Parallel to this, a new and small benchmark was also produced to evaluate a model’s generative performance on a language modeling task.