Publicação

BERT Mapper: An Entity Linking Method for Patent Text

Ver documento

Detalhes bibliográficos
Resumo:To assess the evolution of technological trends in past decades, the data science team at the European Patent Office aimed at the development of an interactive dashboard tracking the mentions of technologies in patent texts. To improve information quality and avoid cluttering the dashboard visualizations with “noisy” and synonymic keywords, an entity linking system was devised. Thus, the system described in this project inserts itself in the sub-field of Entity Linking, under Natural Language Processing. Its goal was extracting the most important technology-related keywords stemming from patent abstracts and titles, assigning them to an entity in the Wikipedia knowledge base. This way, only the matched entity and not the extracted keyword would be showcased in the final dashboard. This entity linking system distinguishes itself from other methods in the state-of-the-art, generating contextually meaningful entity vectors using BERT, by only extracting and averaging the token vectors corresponding to the entity’s surface form, across the entire knowledge base. It is also the first time that such a system has been applied to the context of patent information, whose linguistic characteristics are unique from other fields. Its main objectives were noise reduction and mapping improvements, particularly in solving disambiguation, overcoming the weaknesses of the system in production. The aforementioned methodology computed vectors which, given the specificity of the downstream task, outperformed the ones calculated using SBERT. This simple yet effective vector generation methodology is the backbone of the full entity linking system proposed in this work, which achieved results that outperformed the baseline evaluation scenarios, such as the system currently in production and DBpedia Spotlight, more than doubling its mapping precision.
Autores principais:Pais, Nuno David Ribeiro
Assunto:Natural Language Processing Entity Linking Deep Learning Milvus BERT
Ano:2023
País:Portugal
Tipo de documento:dissertação de mestrado
Tipo de acesso:acesso restrito
Instituição associada:Universidade Nova de Lisboa
Idioma:inglês
Origem:Repositório Institucional da UNL
Descrição
Resumo:To assess the evolution of technological trends in past decades, the data science team at the European Patent Office aimed at the development of an interactive dashboard tracking the mentions of technologies in patent texts. To improve information quality and avoid cluttering the dashboard visualizations with “noisy” and synonymic keywords, an entity linking system was devised. Thus, the system described in this project inserts itself in the sub-field of Entity Linking, under Natural Language Processing. Its goal was extracting the most important technology-related keywords stemming from patent abstracts and titles, assigning them to an entity in the Wikipedia knowledge base. This way, only the matched entity and not the extracted keyword would be showcased in the final dashboard. This entity linking system distinguishes itself from other methods in the state-of-the-art, generating contextually meaningful entity vectors using BERT, by only extracting and averaging the token vectors corresponding to the entity’s surface form, across the entire knowledge base. It is also the first time that such a system has been applied to the context of patent information, whose linguistic characteristics are unique from other fields. Its main objectives were noise reduction and mapping improvements, particularly in solving disambiguation, overcoming the weaknesses of the system in production. The aforementioned methodology computed vectors which, given the specificity of the downstream task, outperformed the ones calculated using SBERT. This simple yet effective vector generation methodology is the backbone of the full entity linking system proposed in this work, which achieved results that outperformed the baseline evaluation scenarios, such as the system currently in production and DBpedia Spotlight, more than doubling its mapping precision.