Publicação

Improving semantic similarity for proteins based on the gene ontology

Ver documento

Detalhes bibliográficos
Resumo:One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The in uence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies. Keywords: Semantic Similarity, BioOntologies, Gene Ontology, Genome Annotation.Abstract One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The influence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies.
Autores principais:Pesquita, Cátia
Assunto:Semantic similarity BioOntologies Gene ontology Genome annotation Teses de mestrado - 2007
Ano:2007
País:Portugal
Tipo de documento:dissertação de mestrado
Tipo de acesso:acesso restrito
Instituição associada:Universidade de Lisboa
Idioma:inglês
Origem:Repositório da Universidade de Lisboa
_version_ 1866811366511214592
author Pesquita, Cátia
author_facet Pesquita, Cátia
author_role author
contributor_name_str_mv Couto, Francisco José Moreira
Repositório Científico de Acesso Aberto da ULisboa
country_str PT
creators_json_txt [{\"Person.name\":\"Pesquita, Cátia\"}]
datacite.contributors.contributor.contributorName.fl_str_mv Couto, Francisco José Moreira
Repositório Científico de Acesso Aberto da ULisboa
datacite.creators.creator.creatorName.fl_str_mv Pesquita, Cátia
datacite.date.Accepted.fl_str_mv 2007-01-01T00:00:00Z
datacite.date.available.fl_str_mv 2009-02-10T13:12:42Z
datacite.date.embargoed.fl_str_mv 2009-02-10T13:12:42Z
datacite.rights.fl_str_mv http://purl.org/coar/access_right/c_16ec
datacite.subjects.subject.fl_str_mv Semantic similarity
BioOntologies
Gene ontology
Genome annotation
Teses de mestrado - 2007
datacite.titles.title.fl_str_mv Improving semantic similarity for proteins based on the gene ontology
dc.contributor.none.fl_str_mv Couto, Francisco José Moreira
Repositório Científico de Acesso Aberto da ULisboa
dc.creator.none.fl_str_mv Pesquita, Cátia
dc.date.Accepted.fl_str_mv 2007-01-01T00:00:00Z
dc.date.available.fl_str_mv 2009-02-10T13:12:42Z
dc.date.embargoed.fl_str_mv 2009-02-10T13:12:42Z
dc.format.none.fl_str_mv application/pdf
dc.identifier.none.fl_str_mv http://hdl.handle.net/10451/14056
dc.language.none.fl_str_mv eng
dc.rights.none.fl_str_mv http://purl.org/coar/access_right/c_16ec
dc.subject.none.fl_str_mv Semantic similarity
BioOntologies
Gene ontology
Genome annotation
Teses de mestrado - 2007
dc.title.fl_str_mv Improving semantic similarity for proteins based on the gene ontology
dc.type.none.fl_str_mv http://purl.org/coar/resource_type/c_bdcc
description One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The in uence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies. Keywords: Semantic Similarity, BioOntologies, Gene Ontology, Genome Annotation.Abstract One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The influence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies.
dirty 0
eu_rights_str_mv restrictedAccess
format masterThesis
fulltext.url.fl_str_mv https://repositorio.ulisboa.pt/bitstreams/9ccb098e-3d70-45a7-b8cf-c6cede40175b/download
id ul_738d53c8fca96a81f2a36c5116a7ffe0
identifier.url.fl_str_mv http://hdl.handle.net/10451/14056
instacron_str ul
institution Universidade de Lisboa
instname_str Universidade de Lisboa
language eng
network_acronym_str ul
network_name_str Repositório da Universidade de Lisboa
oai_identifier_str oai:repositorio.ulisboa.pt:10451/14056
organization_str_mv urn:organizationAcronym:ul
person_str_mv Pesquita, Cátia
publishDate 2007
reponame_str Repositório da Universidade de Lisboa
repository_id_str urn:repositoryAcronym:ul
service_str_mv urn:repositoryAcronym:ul
spelling engporOne of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The in uence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies. Keywords: Semantic Similarity, BioOntologies, Gene Ontology, Genome Annotation.Abstract One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The influence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies.application/pdfporImproving semantic similarity for proteins based on the gene ontologyPesquita, CátiaCouto, Francisco José MoreiraHostingInstitutionOrganizationalRepositório Científico de Acesso Aberto da ULisboae-mailmailto:repositorio@reitoria.ulisboa.ptrepositorio@reitoria.ulisboa.ptURLhttp://repositorio.ul.pt/handle/10455/30752009-02-10T13:12:42Z20072007-01-01T00:00:00ZHandlehttp://hdl.handle.net/10451/14056http://purl.org/coar/access_right/c_16ecrestricted accessSemantic similarityBioOntologiesGene ontologyGenome annotationTeses de mestrado - 20071407812 bytesliteraturehttp://purl.org/coar/resource_type/c_bdccmaster thesishttp://purl.org/coar/access_right/c_16ecapplication/pdffulltexthttps://repositorio.ulisboa.pt/bitstreams/9ccb098e-3d70-45a7-b8cf-c6cede40175b/download
spellingShingle Improving semantic similarity for proteins based on the gene ontology
Pesquita, Cátia
Semantic similarity
BioOntologies
Gene ontology
Genome annotation
Teses de mestrado - 2007
status SINGLETON
subject.fl_str_mv Semantic similarity
BioOntologies
Gene ontology
Genome annotation
Teses de mestrado - 2007
title Improving semantic similarity for proteins based on the gene ontology
title_full Improving semantic similarity for proteins based on the gene ontology
title_fullStr Improving semantic similarity for proteins based on the gene ontology
title_full_unstemmed Improving semantic similarity for proteins based on the gene ontology
title_short Improving semantic similarity for proteins based on the gene ontology
title_sort Improving semantic similarity for proteins based on the gene ontology
topic Semantic similarity
BioOntologies
Gene ontology
Genome annotation
Teses de mestrado - 2007
topic_facet Semantic similarity
BioOntologies
Gene ontology
Genome annotation
Teses de mestrado - 2007
url http://hdl.handle.net/10451/14056
visible 1