Publicação

Time-aware focused web crawling

Ver documento

Detalhes bibliográficos
Resumo:There is a plethora of information inside the Web. Even the top commercial search engines can not download and index all the available information. So, in the recent years, there are several research works on the design and implementation of focused topic crawlers and also on geographic scope crawlers. Despite other areas of information retrieval, research on Web crawling is not using the temporal information extracted from Web pages in the used crawling criteria. Therefore, our research challenge is the use of temporal data extracted from Web pages as the main crawling criteria to satisfy a given temporal focus. The importance of the time dimension is quite amplified when combined with topic or geography, but now we want to study it isolated. The used approach is based on temporal segmentation of Web pages text. It only follows links within segments tagged with dates in the scope of restriction. A precision around 75% was achieved in preliminary experimental results.
Autores principais:Pereira, Pedro
Outros Autores:Macedo, Joaquim; Craveiro, Olga; Madeira, Henrique
Assunto:Web crawling Temporal text segmentation Temporal information extraction Temporal information retrieval
Ano:2014
País:Portugal
Tipo de documento:comunicação em conferência
Tipo de acesso:acesso restrito
Instituição associada:Universidade do Minho
Idioma:inglês
Origem:RepositóriUM - Universidade do Minho
Descrição
Resumo:There is a plethora of information inside the Web. Even the top commercial search engines can not download and index all the available information. So, in the recent years, there are several research works on the design and implementation of focused topic crawlers and also on geographic scope crawlers. Despite other areas of information retrieval, research on Web crawling is not using the temporal information extracted from Web pages in the used crawling criteria. Therefore, our research challenge is the use of temporal data extracted from Web pages as the main crawling criteria to satisfy a given temporal focus. The importance of the time dimension is quite amplified when combined with topic or geography, but now we want to study it isolated. The used approach is based on temporal segmentation of Web pages text. It only follows links within segments tagged with dates in the scope of restriction. A precision around 75% was achieved in preliminary experimental results.