Publicação
Clickstream data warehousing for web crawlers profiling
| Resumo: | Web sites routinely monitor visitor traffic as a useful measure of their overall success. However, simple summaries such as the total number of visits per month provide little insight about individual site patterns, especially in a changing environment like the Web. In this paper it is described an approach to usage profiling based on clickstream data collected on several Web servers' sites and stored in a specialized clickstream data warehousing. We aim at providing valuable insights about common users, but also preventing unauthorised access to contents and any form of overload that might deteriorate site performance. Common crawler detection heuristics help to classify sessions, enabling the construction of site-specific profile training sets. Then, classification algorithms are used for building predictive models that can evaluate unseen sessions, namely their nature and potential site hazard, when they are still ongoing. |
|---|---|
| Autores principais: | Lourenço, Anália |
| Outros Autores: | Belo, Orlando |
| Assunto: | Data warehousing Clickstream data Web housing Web usage mining Web crawler profiling |
| Ano: | 2011 |
| País: | Portugal |
| Tipo de documento: | comunicação em conferência |
| Tipo de acesso: | acesso restrito |
| Instituição associada: | Universidade do Minho |
| Idioma: | inglês |
| Origem: | RepositóriUM - Universidade do Minho |
| Resumo: | Web sites routinely monitor visitor traffic as a useful measure of their overall success. However, simple summaries such as the total number of visits per month provide little insight about individual site patterns, especially in a changing environment like the Web. In this paper it is described an approach to usage profiling based on clickstream data collected on several Web servers' sites and stored in a specialized clickstream data warehousing. We aim at providing valuable insights about common users, but also preventing unauthorised access to contents and any form of overload that might deteriorate site performance. Common crawler detection heuristics help to classify sessions, enabling the construction of site-specific profile training sets. Then, classification algorithms are used for building predictive models that can evaluate unseen sessions, namely their nature and potential site hazard, when they are still ongoing. |
|---|