Português Contacts Subscribe RSS

Document details

Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora

Author(s): Gonçalves, Carlos Jorge de Sousa

Date: 2017

Persistent ID: http://hdl.handle.net/10362/28488

Origin: Repositório Institucional da UNL

Subject(s): Parallel and Distributed Computing; Extraction of Relevant Expressions; Statistical n-gram Methods; Caching Strategies; Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática

Description

The amount of information available through the Internet has been showing a significant growth in the last decade. The information can result from various sources such as scientific experiments resulting from particle acceleration, recording the flight data of a commercial aircraft, or sets of documents from a given domain such as medical articles, news headlines from a newspaper, or social networks contents. Due to the volume of data that must be analyzed, it is necessary to endow the search engines with new tools that allow the user to obtain the desired information in a timely and accurate manner. One approach is the annotation of documents with their relevant expressions. The extraction of relevant expressions from natural language text documents can be accomplished by the use of semantic, syntactic, or statistical techniques. Although the latter tend to be not so accurate, they have the advantage of being independent of the language. This investigation was performed in the context of LocalMaxs, which is a statistical method, thus language-independent, capable of extracting relevant expressions from natural language corpora. However, due to the large volume of data involved, the sequential implementations of the above techniques have severe limitations both in terms of execution time and memory space. In this thesis we propose a distributed architecture and strategies for parallel implementations of statistical-based extraction of relevant expressions from large corpora. A methodology was developed for modeling and evaluating those strategies based on empirical and theoretical approaches to estimate the statistical distribution of n-grams in natural language corpora. These approaches were applied to guide the design and evaluation of the behavior of LocalMaxs parallel and distributed implementations on cluster and cloud computing platforms. The implementation alternatives were compared regarding their precision and recall, and their performance metrics, namely, execution time, parallel speedup and sizeup. The performance results indicate almost linear speedup and sizeup for the range of large corpora sizes.

Document Type Doctoral thesis
Language English
Advisor(s) Cunha, José; Silva, Joaquim
Contributor(s) RUN

Document details

Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora

Related documents

A theoretical model for n-gram distribution in big data corpora

Autonomic workflow activities: the award framework