Document details

Time and Space Efficient Data Structures for Supporting Machine Translation Tasks

Author(s): Costa, Jorge André Nogueira da

Date: 2017

Persistent ID: http://hdl.handle.net/10362/28929

Origin: Repositório Institucional da UNL

Subject(s): Bilingual texts; Byte-codesWavelet Tree; Suffix Array; Machine Translation; Bilingual Framework; bilingual search; Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática; Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática; Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática


Description

The amount of digital natural language text collections available nowadays is huge and it has been growing at an exponential rate. All this information can be easily accessed by individuals of several nationalities and cultures. This leads to the development of new and innovative techniques and tools, for processing and indexing these texts, in fields of research such as Machine Translation, Natural Language Processing or Cross-Language Information Retrieval. Over the years, a lot of important work has been developed, using efficient data structures, such as suffix arrays, for fast pattern matching and to determine statistics. However, these data structures require a considerable amount of space, around four times the text size, which is a problem considering the amount of bilingual texts available in so many languages. This thesis proposal introduces a two-layer bilingual framework based on compact data structures, for indexing parallel texts, translation memories and bilingual lexica, and their alignments, in pairs of two different languages. Besides a word-based suffix array implementation, this thesis proposal presents a solution based on two byte-codes wavelet trees, one for each text, and bitmaps to represent the alignment. Additionally, it introduces a skip-based bilingual search procedure that speeds up the search time response of the framework, for operations over pairs of word,multi-word or discontiguous phrases. For indexing and querying over aligned parallel corpora, the bilingual framework presents a space consumption around 50% of the alignment-annotated corpora size, against the 160% of the non compressed approach. In terms of search time response, the compressed approach is slower than the one based on suffix arrays as expected. The skip-based bilingual search procedure improves the time response from the original bilingual search algorithm from 1.6x to 2.3x in average. With such space requirements, the framework is able to represent huge amounts of data in main memory, avoiding the considerably slower disk accesses, and to support tasks such as translation, text alignment, word-sense disambiguation or context analysis.

Document Type Doctoral thesis
Language English
Advisor(s) Lopes, José; Russo, Luís
Contributor(s) Costa, Jorge André Nogueira da
facebook logo  linkedin logo  twitter logo 
mendeley logo

Related documents