Author(s):
Menezes, Miguel ; Cabarrão, Vera ; Moniz, Helena ; Mota, Pedro
Date: 2022
Origin: Revista da Associação Portuguesa de Linguística
Subject(s): Tradução Automática; Entidades Mencionadas; Anotação; Sistemas de Alinhamento; Machine-Translation; Named Entities; Annotation; Gold Standards; Aligners
Description
The following article describes the research developed at Unbabel, a Portuguese Machine-Translation start-up, that combines Machine Translation (MT) with human post-edition with a focus on customer service content. With the work carried out within a real multilingual AI powered, human-refined, MT industry, we aim to contribute to furthering MT quality and good-practices, by exposing the importance of having continuously in development, robust Named Entity Recognition systems for General Data Protection Regulation (GDPR) compliance. We will report three different experiments, resulting from a shared work with Unbabel´s linguists and Unbabel´s Artificial Intelligence (AI) engineering team, matured over a year. The first experiment focused on developing a methodology for the identification and annotation of domain-specific Named Entities (NEs) for the Food-Industry. The devised methodology allows the construction of gold standards for building domain specific NER systems and can be applied for a myriad of different domains. With the implementation of the designed method, we were able to identify the following domain-specific NEs set: Restaurant Names; Restaurant Chains; Dishes; Beverage, Ingredients. The second and third experiments explored the possibilities of constructing, in a semi-automatically way, multilingual NER gold standards for different domains and language pairs, using aligners that project Named Entities across a parallel corpus. Both experiments made it possible to benchmark four different open-source aligners (SimAlign; Fastalign; AwesomeAlign; Eflomal), allowing to identify the one with better performance and, simultaneously, validate the aforementioned approach. This work should be taken as a statement of multidisciplinary, proving and validating the much-needed articulation between different scientific fields that compose and characterize the area of Natural Language Processing (NLP).