Publicação
Named Entities Recognition for Machine Translation: A Case Study on the Importance of Named Entities for Customer Support
| Resumo: | The last two decades have been of significant change in the international panorama at all levels. The onset of the internet and content availability has propelled us to a new era: The Information Age. The staggering growth of new digital contents, either in the form of ebooks, on-demand TV shows, blogs or even e-commerce websites, has led to an increase in the need for translated material, influenced by people's demand for a quick access to this shared knowledge in their native languages and dialects. Fortunately, machine translation technologies (MT), which provide in many cases human-like translations, are now more widely available, enabling quicker translations for multiple languages at more affordable prices. This work describes the Natural Language Process (NLP) sub-task known as Named Entity Recognition (NER), performed by Unbabel, a Portuguese Machine-translation start-up that combines MT with human post-edition and focuses strictly on customer service content, to improve translation quality outputs. The main objective of this study is to contribute to furthering MT quality and good-practices by exposing the importance of having a continuously-in-development robust Named Entity Recognition system for generic and client-specific content in an MT pipeline and for General Data Protection Regulation (GDPR) compliance; moreover, having in mind future applications, we have tested strategies that support the creation of Multilingual Named Entities Recognition Systems. In the following work, we will first define the meaning of Named Entity, highlighting its importance in a Machine Translation scenario, followed by a brief historical overview of the subject. We will also provide a reasonable description of the most recent data-driven Machine Translation technologies. Concerning the main topic of this work, we will describe three experiments carried out jointly with Unbabel´s NLP team. The first experiment focuses on assisting the NLP team in the creation of a domain-specific Named Entity Recognition (NER) system. The second and third experiments explore the possibilities to create in a semi-automatically fashion multilingual NER gold standards, by resorting to aligners able to project Named Entities between a parallel corpus. |
|---|---|
| Autores principais: | Menezes, Luís Miguel Correia |
| Assunto: | Unbabel. - (Lisboa, Portugal) Tradução automática Tratamento automático da linguagem natural Reconhecimento de entidades mencionadas Tradução Teses de mestrado - 2021 |
| Ano: | 2021 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Universidade de Lisboa |
| Idioma: | inglês |
| Origem: | Repositório da Universidade de Lisboa |
| Resumo: | The last two decades have been of significant change in the international panorama at all levels. The onset of the internet and content availability has propelled us to a new era: The Information Age. The staggering growth of new digital contents, either in the form of ebooks, on-demand TV shows, blogs or even e-commerce websites, has led to an increase in the need for translated material, influenced by people's demand for a quick access to this shared knowledge in their native languages and dialects. Fortunately, machine translation technologies (MT), which provide in many cases human-like translations, are now more widely available, enabling quicker translations for multiple languages at more affordable prices. This work describes the Natural Language Process (NLP) sub-task known as Named Entity Recognition (NER), performed by Unbabel, a Portuguese Machine-translation start-up that combines MT with human post-edition and focuses strictly on customer service content, to improve translation quality outputs. The main objective of this study is to contribute to furthering MT quality and good-practices by exposing the importance of having a continuously-in-development robust Named Entity Recognition system for generic and client-specific content in an MT pipeline and for General Data Protection Regulation (GDPR) compliance; moreover, having in mind future applications, we have tested strategies that support the creation of Multilingual Named Entities Recognition Systems. In the following work, we will first define the meaning of Named Entity, highlighting its importance in a Machine Translation scenario, followed by a brief historical overview of the subject. We will also provide a reasonable description of the most recent data-driven Machine Translation technologies. Concerning the main topic of this work, we will describe three experiments carried out jointly with Unbabel´s NLP team. The first experiment focuses on assisting the NLP team in the creation of a domain-specific Named Entity Recognition (NER) system. The second and third experiments explore the possibilities to create in a semi-automatically fashion multilingual NER gold standards, by resorting to aligners able to project Named Entities between a parallel corpus. |
|---|