Publicação
Improving semantic similarity for proteins based on the gene ontology
| Resumo: | One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The in uence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies. Keywords: Semantic Similarity, BioOntologies, Gene Ontology, Genome Annotation.Abstract One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The influence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies. |
|---|---|
| Autores principais: | Pesquita, Cátia |
| Assunto: | Semantic similarity BioOntologies Gene ontology Genome annotation Teses de mestrado - 2007 |
| Ano: | 2007 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso restrito |
| Instituição associada: | Universidade de Lisboa |
| Idioma: | inglês |
| Origem: | Repositório da Universidade de Lisboa |
| _version_ | 1866811366511214592 |
|---|---|
| author | Pesquita, Cátia |
| author_facet | Pesquita, Cátia |
| author_role | author |
| contributor_name_str_mv | Couto, Francisco José Moreira Repositório Científico de Acesso Aberto da ULisboa |
| country_str | PT |
| creators_json_txt | [{\"Person.name\":\"Pesquita, Cátia\"}] |
| datacite.contributors.contributor.contributorName.fl_str_mv | Couto, Francisco José Moreira Repositório Científico de Acesso Aberto da ULisboa |
| datacite.creators.creator.creatorName.fl_str_mv | Pesquita, Cátia |
| datacite.date.Accepted.fl_str_mv | 2007-01-01T00:00:00Z |
| datacite.date.available.fl_str_mv | 2009-02-10T13:12:42Z |
| datacite.date.embargoed.fl_str_mv | 2009-02-10T13:12:42Z |
| datacite.rights.fl_str_mv | http://purl.org/coar/access_right/c_16ec |
| datacite.subjects.subject.fl_str_mv | Semantic similarity BioOntologies Gene ontology Genome annotation Teses de mestrado - 2007 |
| datacite.titles.title.fl_str_mv | Improving semantic similarity for proteins based on the gene ontology |
| dc.contributor.none.fl_str_mv | Couto, Francisco José Moreira Repositório Científico de Acesso Aberto da ULisboa |
| dc.creator.none.fl_str_mv | Pesquita, Cátia |
| dc.date.Accepted.fl_str_mv | 2007-01-01T00:00:00Z |
| dc.date.available.fl_str_mv | 2009-02-10T13:12:42Z |
| dc.date.embargoed.fl_str_mv | 2009-02-10T13:12:42Z |
| dc.format.none.fl_str_mv | application/pdf |
| dc.identifier.none.fl_str_mv | http://hdl.handle.net/10451/14056 |
| dc.language.none.fl_str_mv | eng |
| dc.rights.none.fl_str_mv | http://purl.org/coar/access_right/c_16ec |
| dc.subject.none.fl_str_mv | Semantic similarity BioOntologies Gene ontology Genome annotation Teses de mestrado - 2007 |
| dc.title.fl_str_mv | Improving semantic similarity for proteins based on the gene ontology |
| dc.type.none.fl_str_mv | http://purl.org/coar/resource_type/c_bdcc |
| description | One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The in uence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies. Keywords: Semantic Similarity, BioOntologies, Gene Ontology, Genome Annotation.Abstract One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The influence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies. |
| dirty | 0 |
| eu_rights_str_mv | restrictedAccess |
| format | masterThesis |
| fulltext.url.fl_str_mv | https://repositorio.ulisboa.pt/bitstreams/9ccb098e-3d70-45a7-b8cf-c6cede40175b/download |
| id | ul_738d53c8fca96a81f2a36c5116a7ffe0 |
| identifier.url.fl_str_mv | http://hdl.handle.net/10451/14056 |
| instacron_str | ul |
| institution | Universidade de Lisboa |
| instname_str | Universidade de Lisboa |
| language | eng |
| network_acronym_str | ul |
| network_name_str | Repositório da Universidade de Lisboa |
| oai_identifier_str | oai:repositorio.ulisboa.pt:10451/14056 |
| organization_str_mv | urn:organizationAcronym:ul |
| person_str_mv | Pesquita, Cátia |
| publishDate | 2007 |
| reponame_str | Repositório da Universidade de Lisboa |
| repository_id_str | urn:repositoryAcronym:ul |
| service_str_mv | urn:repositoryAcronym:ul |
| spelling | engporOne of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The in uence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies. Keywords: Semantic Similarity, BioOntologies, Gene Ontology, Genome Annotation.Abstract One of the current challenges in the Life Sciences is to extract the knowledge contained in the vast amount of data that the genomic and post-genomic techniques are producing. One of the major efforts in this area was the development of the Gene Ontology (GO), a BioOntology that contains terms that describe gene products, organized in a graph structure. Gene products annotated with ontology terms can be compared according to them. This process is called semantic similarity and it is based on the structure of the BioOntology and the relations between its terms, focusing either on a structural comparison or more frequently on the semantic similarity between the terms themselves. In this work, I developed two novel hybrid measures of semantic similarity for proteins based on the Gene Ontology: simGIC (Graph-Information Content similarity) and simGED (Graph-Edit-Distance similarity). These measures were designed to take into account both graph attributes and the terms' information content, thus capturing more information than the previously existing measures which focused mostly on a single aspect (graph structure or term similarity). These two novel measures were evaluated against several previously proposed measures, using two strategies: relationship with sequence similarity and correlation with family similarity. The evaluation metric in the sequence similarity studies was the resolution of the measures, i.e. the range of semantic similarity values they cover, since most measures showed the same behaviour and similar correlation values. Overall simGIC was shown to be the best performer, with both the highest resolutions in the sequence similarity evaluation and highest correlation to family similarity, while simGED obtained above average results. The influence of electronic annotations was also investigated but I found no conclusive evidence to support the general view that these are unreliable to use in semantic similarity studies.application/pdfporImproving semantic similarity for proteins based on the gene ontologyPesquita, CátiaCouto, Francisco José MoreiraHostingInstitutionOrganizationalRepositório Científico de Acesso Aberto da ULisboae-mailmailto:repositorio@reitoria.ulisboa.ptrepositorio@reitoria.ulisboa.ptURLhttp://repositorio.ul.pt/handle/10455/30752009-02-10T13:12:42Z20072007-01-01T00:00:00ZHandlehttp://hdl.handle.net/10451/14056http://purl.org/coar/access_right/c_16ecrestricted accessSemantic similarityBioOntologiesGene ontologyGenome annotationTeses de mestrado - 20071407812 bytesliteraturehttp://purl.org/coar/resource_type/c_bdccmaster thesishttp://purl.org/coar/access_right/c_16ecapplication/pdffulltexthttps://repositorio.ulisboa.pt/bitstreams/9ccb098e-3d70-45a7-b8cf-c6cede40175b/download |
| spellingShingle | Improving semantic similarity for proteins based on the gene ontology Pesquita, Cátia Semantic similarity BioOntologies Gene ontology Genome annotation Teses de mestrado - 2007 |
| status | SINGLETON |
| subject.fl_str_mv | Semantic similarity BioOntologies Gene ontology Genome annotation Teses de mestrado - 2007 |
| title | Improving semantic similarity for proteins based on the gene ontology |
| title_full | Improving semantic similarity for proteins based on the gene ontology |
| title_fullStr | Improving semantic similarity for proteins based on the gene ontology |
| title_full_unstemmed | Improving semantic similarity for proteins based on the gene ontology |
| title_short | Improving semantic similarity for proteins based on the gene ontology |
| title_sort | Improving semantic similarity for proteins based on the gene ontology |
| topic | Semantic similarity BioOntologies Gene ontology Genome annotation Teses de mestrado - 2007 |
| topic_facet | Semantic similarity BioOntologies Gene ontology Genome annotation Teses de mestrado - 2007 |
| url | http://hdl.handle.net/10451/14056 |
| visible | 1 |