Publicação

Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora

Ver documento

Detalhes bibliográficos
Resumo:The amount of information available through the Internet has been showing a significant growth in the last decade. The information can result from various sources such as scientific experiments resulting from particle acceleration, recording the flight data of a commercial aircraft, or sets of documents from a given domain such as medical articles, news headlines from a newspaper, or social networks contents. Due to the volume of data that must be analyzed, it is necessary to endow the search engines with new tools that allow the user to obtain the desired information in a timely and accurate manner. One approach is the annotation of documents with their relevant expressions. The extraction of relevant expressions from natural language text documents can be accomplished by the use of semantic, syntactic, or statistical techniques. Although the latter tend to be not so accurate, they have the advantage of being independent of the language. This investigation was performed in the context of LocalMaxs, which is a statistical method, thus language-independent, capable of extracting relevant expressions from natural language corpora. However, due to the large volume of data involved, the sequential implementations of the above techniques have severe limitations both in terms of execution time and memory space. In this thesis we propose a distributed architecture and strategies for parallel implementations of statistical-based extraction of relevant expressions from large corpora. A methodology was developed for modeling and evaluating those strategies based on empirical and theoretical approaches to estimate the statistical distribution of n-grams in natural language corpora. These approaches were applied to guide the design and evaluation of the behavior of LocalMaxs parallel and distributed implementations on cluster and cloud computing platforms. The implementation alternatives were compared regarding their precision and recall, and their performance metrics, namely, execution time, parallel speedup and sizeup. The performance results indicate almost linear speedup and sizeup for the range of large corpora sizes.
Autores principais:Gonçalves, Carlos Jorge de Sousa
Assunto:Parallel and Distributed Computing Extraction of Relevant Expressions Statistical n-gram Methods Caching Strategies
Ano:2017
País:Portugal
Tipo de documento:tese de doutoramento
Tipo de acesso:acesso aberto
Instituição associada:Universidade Nova de Lisboa
Idioma:inglês
Origem:Repositório Institucional da UNL
_version_ 1868414995671285760
author Gonçalves, Carlos Jorge de Sousa
author_facet Gonçalves, Carlos Jorge de Sousa
author_role author
contributor_name_str_mv Cunha, José
Silva, Joaquim
RUN
country_str PT
creators_json_txt [{\"Person.name\":\"Gonçalves, Carlos Jorge de Sousa\"}]
datacite.contributors.contributor.contributorName.fl_str_mv Cunha, José
Silva, Joaquim
RUN
datacite.creators.creator.creatorName.fl_str_mv Gonçalves, Carlos Jorge de Sousa
datacite.date.Accepted.fl_str_mv 2017-12-01T00:00:00Z
datacite.date.available.fl_str_mv 2018-01-18T15:27:50Z
datacite.date.embargoed.fl_str_mv 2018-01-18T15:27:50Z
datacite.rights.fl_str_mv http://purl.org/coar/access_right/c_abf2
datacite.subjects.subject.fl_str_mv Parallel and Distributed Computing
Extraction of Relevant Expressions
Statistical n-gram Methods
Caching Strategies
datacite.titles.title.fl_str_mv Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora
dc.contributor.none.fl_str_mv Cunha, José
Silva, Joaquim
RUN
dc.creator.none.fl_str_mv Gonçalves, Carlos Jorge de Sousa
dc.date.Accepted.fl_str_mv 2017-12-01T00:00:00Z
dc.date.available.fl_str_mv 2018-01-18T15:27:50Z
dc.date.embargoed.fl_str_mv 2018-01-18T15:27:50Z
dc.format.none.fl_str_mv application/pdf
dc.identifier.none.fl_str_mv http://hdl.handle.net/10362/28488
dc.language.none.fl_str_mv eng
dc.rights.none.fl_str_mv http://purl.org/coar/access_right/c_abf2
dc.subject.none.fl_str_mv Parallel and Distributed Computing
Extraction of Relevant Expressions
Statistical n-gram Methods
Caching Strategies
dc.title.fl_str_mv Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora
dc.type.none.fl_str_mv http://purl.org/coar/resource_type/c_db06
description The amount of information available through the Internet has been showing a significant growth in the last decade. The information can result from various sources such as scientific experiments resulting from particle acceleration, recording the flight data of a commercial aircraft, or sets of documents from a given domain such as medical articles, news headlines from a newspaper, or social networks contents. Due to the volume of data that must be analyzed, it is necessary to endow the search engines with new tools that allow the user to obtain the desired information in a timely and accurate manner. One approach is the annotation of documents with their relevant expressions. The extraction of relevant expressions from natural language text documents can be accomplished by the use of semantic, syntactic, or statistical techniques. Although the latter tend to be not so accurate, they have the advantage of being independent of the language. This investigation was performed in the context of LocalMaxs, which is a statistical method, thus language-independent, capable of extracting relevant expressions from natural language corpora. However, due to the large volume of data involved, the sequential implementations of the above techniques have severe limitations both in terms of execution time and memory space. In this thesis we propose a distributed architecture and strategies for parallel implementations of statistical-based extraction of relevant expressions from large corpora. A methodology was developed for modeling and evaluating those strategies based on empirical and theoretical approaches to estimate the statistical distribution of n-grams in natural language corpora. These approaches were applied to guide the design and evaluation of the behavior of LocalMaxs parallel and distributed implementations on cluster and cloud computing platforms. The implementation alternatives were compared regarding their precision and recall, and their performance metrics, namely, execution time, parallel speedup and sizeup. The performance results indicate almost linear speedup and sizeup for the range of large corpora sizes.
dirty 0
eu_rights_str_mv openAccess
format doctoralThesis
fulltext.url.fl_str_mv https://run.unl.pt/bitstreams/c447279e-a5c4-4d26-b704-06b01a4dfdbd/download
id run_65a541e4c25c1cd7142490df69964dd3
identifier.url.fl_str_mv http://hdl.handle.net/10362/28488
instacron_str unl
institution Universidade Nova de Lisboa
instname_str Universidade Nova de Lisboa
language eng
network_acronym_str run
network_name_str Repositório Institucional da UNL
oai_identifier_str oai:run.unl.pt:10362/28488
organization_str_mv urn:organizationAcronym:unl
person_str_mv Gonçalves, Carlos Jorge de Sousa
publishDate 2017
reponame_str Repositório Institucional da UNL
repository_id_str urn:repositoryAcronym:run
service_str_mv urn:repositoryAcronym:run
spelling engpt_PTThe amount of information available through the Internet has been showing a significant growth in the last decade. The information can result from various sources such as scientific experiments resulting from particle acceleration, recording the flight data of a commercial aircraft, or sets of documents from a given domain such as medical articles, news headlines from a newspaper, or social networks contents. Due to the volume of data that must be analyzed, it is necessary to endow the search engines with new tools that allow the user to obtain the desired information in a timely and accurate manner. One approach is the annotation of documents with their relevant expressions. The extraction of relevant expressions from natural language text documents can be accomplished by the use of semantic, syntactic, or statistical techniques. Although the latter tend to be not so accurate, they have the advantage of being independent of the language. This investigation was performed in the context of LocalMaxs, which is a statistical method, thus language-independent, capable of extracting relevant expressions from natural language corpora. However, due to the large volume of data involved, the sequential implementations of the above techniques have severe limitations both in terms of execution time and memory space. In this thesis we propose a distributed architecture and strategies for parallel implementations of statistical-based extraction of relevant expressions from large corpora. A methodology was developed for modeling and evaluating those strategies based on empirical and theoretical approaches to estimate the statistical distribution of n-grams in natural language corpora. These approaches were applied to guide the design and evaluation of the behavior of LocalMaxs parallel and distributed implementations on cluster and cloud computing platforms. The implementation alternatives were compared regarding their precision and recall, and their performance metrics, namely, execution time, parallel speedup and sizeup. The performance results indicate almost linear speedup and sizeup for the range of large corpora sizes.application/pdfpt_PTParallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large CorporaGonçalves, Carlos Jorge de SousaCunha, JoséSilva, JoaquimHostingInstitutionOrganizationalRUNe-mailmailto:run@unl.ptrun@unl.ptURNurn:tid:1015777962018-01-18T15:27:50Z2017-1220172017-12-01T00:00:00ZHandlehttp://hdl.handle.net/10362/28488http://purl.org/coar/access_right/c_abf2open accessParallel and Distributed ComputingExtraction of Relevant ExpressionsStatistical n-gram MethodsCaching Strategies16917361 bytesliteraturehttp://purl.org/coar/resource_type/c_db06doctoral thesishttp://purl.org/coar/access_right/c_abf2application/pdffulltexthttps://run.unl.pt/bitstreams/c447279e-a5c4-4d26-b704-06b01a4dfdbd/download
spellingShingle Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora
Gonçalves, Carlos Jorge de Sousa
Parallel and Distributed Computing
Extraction of Relevant Expressions
Statistical n-gram Methods
Caching Strategies
status SINGLETON
subject.fl_str_mv Parallel and Distributed Computing
Extraction of Relevant Expressions
Statistical n-gram Methods
Caching Strategies
title Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora
title_full Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora
title_fullStr Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora
title_full_unstemmed Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora
title_short Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora
title_sort Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora
topic Parallel and Distributed Computing
Extraction of Relevant Expressions
Statistical n-gram Methods
Caching Strategies
topic_facet Parallel and Distributed Computing
Extraction of Relevant Expressions
Statistical n-gram Methods
Caching Strategies
url http://hdl.handle.net/10362/28488
visible 1