Publicação
BERT Mapper: An Entity Linking Method for Patent Text
| Resumo: | To assess the evolution of technological trends in past decades, the data science team at the European Patent Office aimed at the development of an interactive dashboard tracking the mentions of technologies in patent texts. To improve information quality and avoid cluttering the dashboard visualizations with “noisy” and synonymic keywords, an entity linking system was devised. Thus, the system described in this project inserts itself in the sub-field of Entity Linking, under Natural Language Processing. Its goal was extracting the most important technology-related keywords stemming from patent abstracts and titles, assigning them to an entity in the Wikipedia knowledge base. This way, only the matched entity and not the extracted keyword would be showcased in the final dashboard. This entity linking system distinguishes itself from other methods in the state-of-the-art, generating contextually meaningful entity vectors using BERT, by only extracting and averaging the token vectors corresponding to the entity’s surface form, across the entire knowledge base. It is also the first time that such a system has been applied to the context of patent information, whose linguistic characteristics are unique from other fields. Its main objectives were noise reduction and mapping improvements, particularly in solving disambiguation, overcoming the weaknesses of the system in production. The aforementioned methodology computed vectors which, given the specificity of the downstream task, outperformed the ones calculated using SBERT. This simple yet effective vector generation methodology is the backbone of the full entity linking system proposed in this work, which achieved results that outperformed the baseline evaluation scenarios, such as the system currently in production and DBpedia Spotlight, more than doubling its mapping precision. |
|---|---|
| Autores principais: | Pais, Nuno David Ribeiro |
| Assunto: | Natural Language Processing Entity Linking Deep Learning Milvus BERT |
| Ano: | 2023 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso restrito |
| Instituição associada: | Universidade Nova de Lisboa |
| Idioma: | inglês |
| Origem: | Repositório Institucional da UNL |
| Resumo: | To assess the evolution of technological trends in past decades, the data science team at the European Patent Office aimed at the development of an interactive dashboard tracking the mentions of technologies in patent texts. To improve information quality and avoid cluttering the dashboard visualizations with “noisy” and synonymic keywords, an entity linking system was devised. Thus, the system described in this project inserts itself in the sub-field of Entity Linking, under Natural Language Processing. Its goal was extracting the most important technology-related keywords stemming from patent abstracts and titles, assigning them to an entity in the Wikipedia knowledge base. This way, only the matched entity and not the extracted keyword would be showcased in the final dashboard. This entity linking system distinguishes itself from other methods in the state-of-the-art, generating contextually meaningful entity vectors using BERT, by only extracting and averaging the token vectors corresponding to the entity’s surface form, across the entire knowledge base. It is also the first time that such a system has been applied to the context of patent information, whose linguistic characteristics are unique from other fields. Its main objectives were noise reduction and mapping improvements, particularly in solving disambiguation, overcoming the weaknesses of the system in production. The aforementioned methodology computed vectors which, given the specificity of the downstream task, outperformed the ones calculated using SBERT. This simple yet effective vector generation methodology is the backbone of the full entity linking system proposed in this work, which achieved results that outperformed the baseline evaluation scenarios, such as the system currently in production and DBpedia Spotlight, more than doubling its mapping precision. |
|---|