Document details

Declarative approach to data extraction of web pages

Author(s): Alves, Ricardo cv logo 1

Date: 2009

Persistent ID: http://hdl.handle.net/10362/5822

Origin: Repositório Institucional da UNL

Subject(s): Web wrappers; Web data extraction; Graphical wrapper construction


Description
In the last few years, we have been witnessing a noticeable WEB evolution with the introduction of significant improvements at technological level, such as the emergence of XHTML, CSS,Javascript, and Web2.0, just to name ones. This, combined with other factors such as physical expansion of the Web, as well as its low cost, have been the great motivator for the organizations and the general public to join, with a consequent growth in the number of users and thus influencing the volume of the largest global data repository. In consequence, there was an increasing need for regular data acquisition from the WEB, and because of its frequency, length or complexity, it would only be viable to obtain through automatic extractors. However, two main difficulties are inherent to automatic extractors. First, much of the Web's information is presented in visual formats mainly directed for human reading. Secondly, the introduction of dynamic webpages, which are brought together in local memory from different sources, causing some pages not to have a source file. Therefore, this thesis proposes a new and more modern extractor, capable of supporting the Web evolution, as well as being generic, so as to be able to be used in any situation, and capable of being extended and easily adaptable to a more particular use. This project is an extension of an earlier one which had the capability of extractions on semi-structured text files. However it evolved to a modular extraction system capable of extracting data from webpages, semi-structured text files and be expanded to support other data source types. It also contains a more complete and generic validation system and a new data delivery system capable of performing the earlier deliveries as well as new generic ones. A graphical editor was also developed to support the extraction system features and to allow a domain expert without computer knowledge to create extractions with only a few simple and intuitive interactions on the rendered webpage. Thesis submitted to Faculdade de Ciências e Tecnologia of the Universidade Nova de Lisboa, in partial fulfilment of the requirements for the degree of Master in Computer Science
Document Type Master Thesis
Language English
Advisor(s) Pires, João
delicious logo  facebook logo  linkedin logo  twitter logo 
degois logo
mendeley logo

Related documents