Português Contacts Subscribe RSS

Document details

Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization

Author(s): Menezes, Miguel ; Cabarrão, Vera ; Moniz, Helena ; Mota, Pedro

Date: 2022

Origin: Revista da Associação Portuguesa de Linguística

Subject(s): Tradução Automática; Entidades Mencionadas; Anotação; Sistemas de Alinhamento; Machine-Translation; Named Entities; Annotation; Gold Standards; Aligners

Description

The following article describes the research developed at Unbabel, a Portuguese Machine-Translation start-up, that combines Machine Translation (MT) with human post-edition with a focus on customer service content. With the work carried out within a real multilingual AI powered, human-refined, MT industry, we aim to contribute to furthering MT quality and good-practices, by exposing the importance of having continuously in development, robust Named Entity Recognition systems for General Data Protection Regulation (GDPR) compliance. We will report three different experiments, resulting from a shared work with Unbabel´s linguists and Unbabel´s Artificial Intelligence (AI) engineering team, matured over a year. The first experiment focused on developing a methodology for the identification and annotation of domain-specific Named Entities (NEs) for the Food-Industry. The devised methodology allows the construction of gold standards for building domain specific NER systems and can be applied for a myriad of different domains. With the implementation of the designed method, we were able to identify the following domain-specific NEs set: Restaurant Names; Restaurant Chains; Dishes; Beverage, Ingredients. The second and third experiments explored the possibilities of constructing, in a semi-automatically way, multilingual NER gold standards for different domains and language pairs, using aligners that project Named Entities across a parallel corpus. Both experiments made it possible to benchmark four different open-source aligners (SimAlign; Fastalign; AwesomeAlign; Eflomal), allowing to identify the one with better performance and, simultaneously, validate the aforementioned approach. This work should be taken as a statement of multidisciplinary, proving and validating the much-needed articulation between different scientific fields that compose and characterize the area of Natural Language Processing (NLP).

Document Type Journal article
Language Portuguese

Document details

Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization

Related documents