Publicação

Named entity recognition on simplified chinese for machine translation

Detalhes bibliográficos
Resumo:	This work is centered around an important task in the field of Machine Translation (MT), which is Named Entity Recognition (NER). The work conducted at Unbabel, an international artificial intelligence-powered human translation company, allowed us to evaluate the performance of the NER system when dealing with Named Entities (NEs) in Simplified Chinese. In relation to the project, two experiments are conducted together with Unbabel’s Natural Language Processing (NLP) team to find out the best way to develop a NER model for Simplified Chinese. Two possible methods are proposed: training the model directly with gold standards created by human annotated data and training the model with gold standards built using NE projection with a word aligner. For both experiments, an important task is NE annotation, which is performed by a professional linguist. The annotated data serves as the gold standard for the experiments afterwards. In the first experiment, we aim to test out the viability of the first method. To achieve this purpose, manually annotated data from the NE annotation task is used to train the NER model. In the second experiment, we aim to test out the viability of training the NER system with gold standards built by the aligner Simalign. In this experiment, data from the NE annotation task is used again as the gold standard to evaluate the performance of the aligner when completing the NE projection task. The performances of both the NER model and the aligner are evaluated according to the standard performance metrics. Through the experiments, we found that even though the NER model achieved overall promising results when trained with manually annotated data, there was still a lot of room for improvements. On the other hand, Simalign yielded very satisfying results when completing the NE projection task. Due to time constraints, we did not train the NER model with the data obtained from Simalign. However, results show that it is a very suitable aligner for NE projection in Simplified Chinese and that using an aligner is a viable way to train a Chinese NER model. We are optimistic that this method surpasses the first one. The results were integrated into two core projects: MAIA (Graça et al, 2020) and the Center for Responsible AI1, due to privacy issues with NE. The results of these experiments are essential in providing us insights on future NER development, which can have a positive impact on the overall improvement of MT quality.
Autores principais:	Yan, Jingxuan
Ano:	2024
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	Universidade de Lisboa
Idioma:	inglês
Origem:	Repositório da Universidade de Lisboa

Registos relacionados

Named Entities Recognition for Machine Translation: A Case Study on the Importance of Named Entities for Customer Support
por: Menezes, Luís Miguel Correia
Publicado em: (2021)

Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization
por: Menezes, Miguel
Publicado em: (2022)

Named Entity Recognition using Machine Learning techniques
por: Miranda, Nuno
Publicado em: (2012)

Enriching Portuguese medieval texts with named entity recognition
por: Bico, M. I.
Publicado em: (2024)

A Golden Resource for Named Entity Recognition in Portuguese
por: Santos, Diana
Publicado em: (2006)

Embeddings for Named Entity Recognition in Geoscience Portuguese Literature
por: Consoli, Bernardo
Publicado em: (2021)

Named entity recognition for sensitive data discovery in Portuguese
por: Dias, M.
Publicado em: (2020)

A Golden Resource for Named Entity Recognition in Portuguese
por: Santos D.
Publicado em: (2006)

Deep Test to Transformers Architecture in Named Entity Recognition
por: Antunes, Gonçalo André Santos
Publicado em: (2022)

NERdy: enhancing information discovery through named entity recognition
por: Magalhães, João Vilas Boas da Silva
Publicado em: (2024)

Bringing named entity recognition on Drupal content management system
por: Fernandes, José
Publicado em: (2014)

Named entity recognition for Distant Reading in several European literatures
por: Stankovic, Ranka
Publicado em: (2019)

HAREM: the first evaluation contest for Named Entity Recognition in Portuguese
por: Santos, Diana
Publicado em: (2006)

SAHARA: an online service for HAREM Named Entity Recognition Evaluation.
por: Hugo Gonçalo Oliveira
Publicado em: (2009)

Named Entity Recognition and Linking in a Multilingual Biomedical Setting
por: Andrade, Vítor Daniel Torres
Publicado em: (2021)

Applying Deep Neural Networks to Named Entity Recognition in Portuguese Texts
por: Ivo Fernandes
Publicado em: (2018)

A Deep Learning Approach to Named Entity Recognition in Portuguese Texts
por: Ivo André Domingues Fernandes
Publicado em: (2018)

Using named entity recognition for relevance detection in social network messages
por: Filipe Daniel da Gama Batista
Publicado em: (2017)

Named Entity Recognition Applied to Portuguese Texts from the XVIII Century
por: Zilio, Leonardo
Publicado em: (2022)

Enhancing Named Entity Recognition in Portuguese Literary Texts with Adaptive Models
por: O. Silva, Mariana
Publicado em: (2025)

Named entity recognition specialised for Portuguese 18th century History research
por: Santos, Joaquim
Publicado em: (2024)

Exploring Named Entity Recognition and Relation Extraction for ontology and medical records integration
por: Silva, Diego
Publicado em: (2023)

NLPyPort: Named Entity Recognition with CRF and Rule-Based Relation Extraction
por: Ferreira, João
Publicado em: (2019)

Named Entity Recognition and Data Leakage in Legislative Texts: A Literature Reassessment
por: Nunes, Rafael Oleques
Publicado em: (2024)

Artificial intelligence in healthcare text processing: a review applied to named entity recognition
por: Almeida, Samuel Santana de
Publicado em: (2025)

Contributions to Clinical Information Extraction in Portuguese: Corpora, Named Entity Recognition, Word Embeddings
por: Lopes, Fábio André da Costa
Publicado em: (2019)

NERP-CRF: A tool for the named entity recognition using conditional random fields
por: Amaral, Daniela Oliveira F. do
Publicado em: (2014)

Portuguese-Chinese neural machine translation
por: Santos, Rodrigo Soares dos
Publicado em: (2019)

Internships in the degree in Translation and Interpreting Portuguese-Chinese / Chinese-Portuguese of the Polytechnic of Leiria: choice of participating entities
por: Caels, Fausto
Publicado em: (2023)

Annotation of Named Entities in the Gaming domain
por: Silva, Rita
Publicado em: (2022)

An investigation of Entity-Aware Neural Machine Translation for Biomedical Texts
por: Analu Rufino Ramos
Publicado em: (2025)

Using machine learning algorithms to identify named entities in legal documents: a preliminary approach
por: Poudyal, Prakash
Publicado em: (2012)

Plano de tese: How to Keep up with Language Dynamics? - A case study on Named Entity Recognition
por: Mota, Cristina
Publicado em: (2005)

Named entity error prediction across domains
por: Andrade, Juliana Valpasso de
Publicado em: (2024)

An Approach to Web-Scale Named-Entity Disambiguation
por: Luís Sarmento
Publicado em: (2009)

Analysis of the chinese – portuguese machine translation of chinese localizers qian and hou
por: Lu, Chunhui
Publicado em: (2015)

ON THE 'TRANSLATION' OR NON "TRANSLATION' OF PROPER NAMES
por: Lopes, Dalila
Publicado em: (2019)

Recognizing and linking named entities in Portuguese medieval texts
por: Bico, M. I.
Publicado em: (2024)

Perianal Paget Disease: Different Entities With the Same Name
por: Santos, Marisa D.
Publicado em: (2021)

Named entity extraction from Portuguese web text
por: André Ricardo Oliveira Pires
Publicado em: (2017)