Publicação

Improving semantic similarity for proteins based on the gene ontology

Detalhes bibliográficos
Resumo:	One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The in uence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies. Keywords: Semantic Similarity, BioOntologies, Gene Ontology, Genome Annotation.Abstract One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The influence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies.
Autores principais:	Pesquita, Cátia
Assunto:	Semantic similarity BioOntologies Gene ontology Genome annotation Teses de mestrado - 2007
Ano:	2007
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso restrito
Instituição associada:	Universidade de Lisboa
Idioma:	inglês
Origem:	Repositório da Universidade de Lisboa

_version_	1866811366511214592
author	Pesquita, Cátia
author_facet	Pesquita, Cátia
author_role	author
contributor_name_str_mv	Couto, Francisco José Moreira Repositório Científico de Acesso Aberto da ULisboa
country_str	PT
creators_json_txt	[{\"Person.name\":\"Pesquita, Cátia\"}]
datacite.contributors.contributor.contributorName.fl_str_mv	Couto, Francisco José Moreira Repositório Científico de Acesso Aberto da ULisboa
datacite.creators.creator.creatorName.fl_str_mv	Pesquita, Cátia
datacite.date.Accepted.fl_str_mv	2007-01-01T00:00:00Z
datacite.date.available.fl_str_mv	2009-02-10T13:12:42Z
datacite.date.embargoed.fl_str_mv	2009-02-10T13:12:42Z
datacite.rights.fl_str_mv	http://purl.org/coar/access_right/c_16ec
datacite.subjects.subject.fl_str_mv	Semantic similarity BioOntologies Gene ontology Genome annotation Teses de mestrado - 2007
datacite.titles.title.fl_str_mv	Improving semantic similarity for proteins based on the gene ontology
dc.contributor.none.fl_str_mv	Couto, Francisco José Moreira Repositório Científico de Acesso Aberto da ULisboa
dc.creator.none.fl_str_mv	Pesquita, Cátia
dc.date.Accepted.fl_str_mv	2007-01-01T00:00:00Z
dc.date.available.fl_str_mv	2009-02-10T13:12:42Z
dc.date.embargoed.fl_str_mv	2009-02-10T13:12:42Z
dc.format.none.fl_str_mv	application/pdf
dc.identifier.none.fl_str_mv	http://hdl.handle.net/10451/14056
dc.language.none.fl_str_mv	eng
dc.rights.none.fl_str_mv	http://purl.org/coar/access_right/c_16ec
dc.subject.none.fl_str_mv	Semantic similarity BioOntologies Gene ontology Genome annotation Teses de mestrado - 2007
dc.title.fl_str_mv	Improving semantic similarity for proteins based on the gene ontology
dc.type.none.fl_str_mv	http://purl.org/coar/resource_type/c_bdcc
description	One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The in uence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies. Keywords: Semantic Similarity, BioOntologies, Gene Ontology, Genome Annotation.Abstract One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The influence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies.
dirty	0
eu_rights_str_mv	restrictedAccess
format	masterThesis
fulltext.url.fl_str_mv	https://repositorio.ulisboa.pt/bitstreams/9ccb098e-3d70-45a7-b8cf-c6cede40175b/download
id	ul_738d53c8fca96a81f2a36c5116a7ffe0
identifier.url.fl_str_mv	http://hdl.handle.net/10451/14056
instacron_str	ul
institution	Universidade de Lisboa
instname_str	Universidade de Lisboa
language	eng
network_acronym_str	ul
network_name_str	Repositório da Universidade de Lisboa
oai_identifier_str	oai:repositorio.ulisboa.pt:10451/14056
organization_str_mv	urn:organizationAcronym:ul
person_str_mv	Pesquita, Cátia
publishDate	2007
reponame_str	Repositório da Universidade de Lisboa
repository_id_str	urn:repositoryAcronym:ul
service_str_mv	urn:repositoryAcronym:ul
spelling	engporOne of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The in uence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies. Keywords: Semantic Similarity, BioOntologies, Gene Ontology, Genome Annotation.Abstract One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The influence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies.application/pdfporImproving semantic similarity for proteins based on the gene ontologyPesquita, CátiaCouto, Francisco José MoreiraHostingInstitutionOrganizationalRepositório Científico de Acesso Aberto da ULisboae-mailmailto:repositorio@reitoria.ulisboa.ptrepositorio@reitoria.ulisboa.ptURLhttp://repositorio.ul.pt/handle/10455/30752009-02-10T13:12:42Z20072007-01-01T00:00:00ZHandlehttp://hdl.handle.net/10451/14056http://purl.org/coar/access_right/c_16ecrestricted accessSemantic similarityBioOntologiesGene ontologyGenome annotationTeses de mestrado - 20071407812 bytesliteraturehttp://purl.org/coar/resource_type/c_bdccmaster thesishttp://purl.org/coar/access_right/c_16ecapplication/pdffulltexthttps://repositorio.ulisboa.pt/bitstreams/9ccb098e-3d70-45a7-b8cf-c6cede40175b/download
spellingShingle	Improving semantic similarity for proteins based on the gene ontology Pesquita, Cátia Semantic similarity BioOntologies Gene ontology Genome annotation Teses de mestrado - 2007
status	SINGLETON
subject.fl_str_mv	Semantic similarity BioOntologies Gene ontology Genome annotation Teses de mestrado - 2007
title	Improving semantic similarity for proteins based on the gene ontology
title_full	Improving semantic similarity for proteins based on the gene ontology
title_fullStr	Improving semantic similarity for proteins based on the gene ontology
title_full_unstemmed	Improving semantic similarity for proteins based on the gene ontology
title_short	Improving semantic similarity for proteins based on the gene ontology
title_sort	Improving semantic similarity for proteins based on the gene ontology
topic	Semantic similarity BioOntologies Gene ontology Genome annotation Teses de mestrado - 2007
topic_facet	Semantic similarity BioOntologies Gene ontology Genome annotation Teses de mestrado - 2007
url	http://hdl.handle.net/10451/14056
visible	1

Publicação

Improving semantic similarity for proteins based on the gene ontology

Registos relacionados