Publicação

Identifying chemical entities on literature:a machine learning approach using dictionaries as domain knowledge

Ver documento

Detalhes bibliográficos
Resumo:The volume of life science publications, and therefore the underlying biomedical knowledge, are growing at a fast pace. However the manual literature analysis is a slow and painful task. Hence, text mining systems have been developed to automatically locate the relevant information contained in the literature. An essential step in text mining is named entitiy recognition, but the inherent complexity of biomedical entities, such as chemical compounds, makes it difficult to obtain good performances in this task. This thesis proposes methods capable to improve the current performance of chemical entity recognition from text. Hereby a case based method for recognizing chemical entities is proposed and the obtained evaluation results outperform the most widely used methods, based in dictionaries. A lexical similarity based chemical entity resolution method was also developed and allows an efficient mapping of the recognized entities to the ChEBI database. To improve the chemical entity identification results we developed a validation method that exploits the semantic relationships in ChEBI to measure the similarity between the entities found in the text, in order to discriminate between the correctly identified entities that can be validated and identification errors that should be discarded. A machine learning method for entity recognition error is also proposed, which can efectively find recognition errors in rule based systems. The methods were integrated in a system capable of recognizing chemical entities in texts, map them to the ChEBI database, and provide evidence of validation or recognition error for the recognized entities.
Autores principais:Grego, Tiago Daniel Pereira, 1983-
Assunto:Pesquisa documental informatizada Texto Compostos químicos Semântica Bioinformática Teses de doutoramento - 2013
Ano:2013
País:Portugal
Tipo de documento:tese de doutoramento
Tipo de acesso:acesso aberto
Instituição associada:Universidade de Lisboa
Idioma:inglês
Origem:Repositório da Universidade de Lisboa
Descrição
Resumo:The volume of life science publications, and therefore the underlying biomedical knowledge, are growing at a fast pace. However the manual literature analysis is a slow and painful task. Hence, text mining systems have been developed to automatically locate the relevant information contained in the literature. An essential step in text mining is named entitiy recognition, but the inherent complexity of biomedical entities, such as chemical compounds, makes it difficult to obtain good performances in this task. This thesis proposes methods capable to improve the current performance of chemical entity recognition from text. Hereby a case based method for recognizing chemical entities is proposed and the obtained evaluation results outperform the most widely used methods, based in dictionaries. A lexical similarity based chemical entity resolution method was also developed and allows an efficient mapping of the recognized entities to the ChEBI database. To improve the chemical entity identification results we developed a validation method that exploits the semantic relationships in ChEBI to measure the similarity between the entities found in the text, in order to discriminate between the correctly identified entities that can be validated and identification errors that should be discarded. A machine learning method for entity recognition error is also proposed, which can efectively find recognition errors in rule based systems. The methods were integrated in a system capable of recognizing chemical entities in texts, map them to the ChEBI database, and provide evidence of validation or recognition error for the recognized entities.