Publicação

Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora

Detalhes bibliográficos
Resumo:	The amount of information available through the Internet has been showing a significant growth in the last decade. The information can result from various sources such as scientific experiments resulting from particle acceleration, recording the flight data of a commercial aircraft, or sets of documents from a given domain such as medical articles, news headlines from a newspaper, or social networks contents. Due to the volume of data that must be analyzed, it is necessary to endow the search engines with new tools that allow the user to obtain the desired information in a timely and accurate manner. One approach is the annotation of documents with their relevant expressions. The extraction of relevant expressions from natural language text documents can be accomplished by the use of semantic, syntactic, or statistical techniques. Although the latter tend to be not so accurate, they have the advantage of being independent of the language. This investigation was performed in the context of LocalMaxs, which is a statistical method, thus language-independent, capable of extracting relevant expressions from natural language corpora. However, due to the large volume of data involved, the sequential implementations of the above techniques have severe limitations both in terms of execution time and memory space. In this thesis we propose a distributed architecture and strategies for parallel implementations of statistical-based extraction of relevant expressions from large corpora. A methodology was developed for modeling and evaluating those strategies based on empirical and theoretical approaches to estimate the statistical distribution of n-grams in natural language corpora. These approaches were applied to guide the design and evaluation of the behavior of LocalMaxs parallel and distributed implementations on cluster and cloud computing platforms. The implementation alternatives were compared regarding their precision and recall, and their performance metrics, namely, execution time, parallel speedup and sizeup. The performance results indicate almost linear speedup and sizeup for the range of large corpora sizes.
Autores principais:	Gonçalves, Carlos Jorge de Sousa
Assunto:	Parallel and Distributed Computing Extraction of Relevant Expressions Statistical n-gram Methods Caching Strategies
Ano:	2017
País:	Portugal
Tipo de documento:	tese de doutoramento
Tipo de acesso:	acesso aberto
Instituição associada:	Universidade Nova de Lisboa
Idioma:	inglês
Origem:	Repositório Institucional da UNL

_version_	1868414995671285760
author	Gonçalves, Carlos Jorge de Sousa
author_facet	Gonçalves, Carlos Jorge de Sousa
author_role	author
contributor_name_str_mv	Cunha, José Silva, Joaquim RUN
country_str	PT
creators_json_txt	[{\"Person.name\":\"Gonçalves, Carlos Jorge de Sousa\"}]
datacite.contributors.contributor.contributorName.fl_str_mv	Cunha, José Silva, Joaquim RUN
datacite.creators.creator.creatorName.fl_str_mv	Gonçalves, Carlos Jorge de Sousa
datacite.date.Accepted.fl_str_mv	2017-12-01T00:00:00Z
datacite.date.available.fl_str_mv	2018-01-18T15:27:50Z
datacite.date.embargoed.fl_str_mv	2018-01-18T15:27:50Z
datacite.rights.fl_str_mv	http://purl.org/coar/access_right/c_abf2
datacite.subjects.subject.fl_str_mv	Parallel and Distributed Computing Extraction of Relevant Expressions Statistical n-gram Methods Caching Strategies
datacite.titles.title.fl_str_mv	Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora
dc.contributor.none.fl_str_mv	Cunha, José Silva, Joaquim RUN
dc.creator.none.fl_str_mv	Gonçalves, Carlos Jorge de Sousa
dc.date.Accepted.fl_str_mv	2017-12-01T00:00:00Z
dc.date.available.fl_str_mv	2018-01-18T15:27:50Z
dc.date.embargoed.fl_str_mv	2018-01-18T15:27:50Z
dc.format.none.fl_str_mv	application/pdf
dc.identifier.none.fl_str_mv	http://hdl.handle.net/10362/28488
dc.language.none.fl_str_mv	eng
dc.rights.none.fl_str_mv	http://purl.org/coar/access_right/c_abf2
dc.subject.none.fl_str_mv	Parallel and Distributed Computing Extraction of Relevant Expressions Statistical n-gram Methods Caching Strategies
dc.title.fl_str_mv	Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora
dc.type.none.fl_str_mv	http://purl.org/coar/resource_type/c_db06
description	The amount of information available through the Internet has been showing a significant growth in the last decade. The information can result from various sources such as scientific experiments resulting from particle acceleration, recording the flight data of a commercial aircraft, or sets of documents from a given domain such as medical articles, news headlines from a newspaper, or social networks contents. Due to the volume of data that must be analyzed, it is necessary to endow the search engines with new tools that allow the user to obtain the desired information in a timely and accurate manner. One approach is the annotation of documents with their relevant expressions. The extraction of relevant expressions from natural language text documents can be accomplished by the use of semantic, syntactic, or statistical techniques. Although the latter tend to be not so accurate, they have the advantage of being independent of the language. This investigation was performed in the context of LocalMaxs, which is a statistical method, thus language-independent, capable of extracting relevant expressions from natural language corpora. However, due to the large volume of data involved, the sequential implementations of the above techniques have severe limitations both in terms of execution time and memory space. In this thesis we propose a distributed architecture and strategies for parallel implementations of statistical-based extraction of relevant expressions from large corpora. A methodology was developed for modeling and evaluating those strategies based on empirical and theoretical approaches to estimate the statistical distribution of n-grams in natural language corpora. These approaches were applied to guide the design and evaluation of the behavior of LocalMaxs parallel and distributed implementations on cluster and cloud computing platforms. The implementation alternatives were compared regarding their precision and recall, and their performance metrics, namely, execution time, parallel speedup and sizeup. The performance results indicate almost linear speedup and sizeup for the range of large corpora sizes.
dirty	0
eu_rights_str_mv	openAccess
format	doctoralThesis
fulltext.url.fl_str_mv	https://run.unl.pt/bitstreams/c447279e-a5c4-4d26-b704-06b01a4dfdbd/download
id	run_65a541e4c25c1cd7142490df69964dd3
identifier.url.fl_str_mv	http://hdl.handle.net/10362/28488
instacron_str	unl
institution	Universidade Nova de Lisboa
instname_str	Universidade Nova de Lisboa
language	eng
network_acronym_str	run
network_name_str	Repositório Institucional da UNL
oai_identifier_str	oai:run.unl.pt:10362/28488
organization_str_mv	urn:organizationAcronym:unl
person_str_mv	Gonçalves, Carlos Jorge de Sousa
publishDate	2017
reponame_str	Repositório Institucional da UNL
repository_id_str	urn:repositoryAcronym:run
service_str_mv	urn:repositoryAcronym:run
spelling	engpt_PTThe amount of information available through the Internet has been showing a significant growth in the last decade. The information can result from various sources such as scientific experiments resulting from particle acceleration, recording the flight data of a commercial aircraft, or sets of documents from a given domain such as medical articles, news headlines from a newspaper, or social networks contents. Due to the volume of data that must be analyzed, it is necessary to endow the search engines with new tools that allow the user to obtain the desired information in a timely and accurate manner. One approach is the annotation of documents with their relevant expressions. The extraction of relevant expressions from natural language text documents can be accomplished by the use of semantic, syntactic, or statistical techniques. Although the latter tend to be not so accurate, they have the advantage of being independent of the language. This investigation was performed in the context of LocalMaxs, which is a statistical method, thus language-independent, capable of extracting relevant expressions from natural language corpora. However, due to the large volume of data involved, the sequential implementations of the above techniques have severe limitations both in terms of execution time and memory space. In this thesis we propose a distributed architecture and strategies for parallel implementations of statistical-based extraction of relevant expressions from large corpora. A methodology was developed for modeling and evaluating those strategies based on empirical and theoretical approaches to estimate the statistical distribution of n-grams in natural language corpora. These approaches were applied to guide the design and evaluation of the behavior of LocalMaxs parallel and distributed implementations on cluster and cloud computing platforms. The implementation alternatives were compared regarding their precision and recall, and their performance metrics, namely, execution time, parallel speedup and sizeup. The performance results indicate almost linear speedup and sizeup for the range of large corpora sizes.application/pdfpt_PTParallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large CorporaGonçalves, Carlos Jorge de SousaCunha, JoséSilva, JoaquimHostingInstitutionOrganizationalRUNe-mailmailto:run@unl.ptrun@unl.ptURNurn:tid:1015777962018-01-18T15:27:50Z2017-1220172017-12-01T00:00:00ZHandlehttp://hdl.handle.net/10362/28488http://purl.org/coar/access_right/c_abf2open accessParallel and Distributed ComputingExtraction of Relevant ExpressionsStatistical n-gram MethodsCaching Strategies16917361 bytesliteraturehttp://purl.org/coar/resource_type/c_db06doctoral thesishttp://purl.org/coar/access_right/c_abf2application/pdffulltexthttps://run.unl.pt/bitstreams/c447279e-a5c4-4d26-b704-06b01a4dfdbd/download
spellingShingle	Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora Gonçalves, Carlos Jorge de Sousa Parallel and Distributed Computing Extraction of Relevant Expressions Statistical n-gram Methods Caching Strategies
status	SINGLETON
subject.fl_str_mv	Parallel and Distributed Computing Extraction of Relevant Expressions Statistical n-gram Methods Caching Strategies
title	Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora
title_full	Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora
title_fullStr	Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora
title_full_unstemmed	Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora
title_short	Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora
title_sort	Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora
topic	Parallel and Distributed Computing Extraction of Relevant Expressions Statistical n-gram Methods Caching Strategies
topic_facet	Parallel and Distributed Computing Extraction of Relevant Expressions Statistical n-gram Methods Caching Strategies
url	http://hdl.handle.net/10362/28488
visible	1

Publicação

Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora

Registos relacionados