Publicação

Prometheus: a generic e-commerce crawler for the study of business markets and other e-commerce problems

Ver documento

Detalhes bibliográficos
Resumo:The continuous social and economic development has led over time to an increase in consumption, as well as greater demand from the consumer for better and cheaper products. Hence, the selling price of a product assumes a fundamental role in the purchase decision by the consumer. In this context, online stores must carefully analyse and define the best price for each product, based on several factors such as production/acquisition cost, positioning of the product (e.g. anchor product) and the competition companies strategy. The work done by market analysts changed drastically over the last years. As the number of Web sites increases exponentially, the number of E-commerce web sites also prosperous. Web page classification becomes more important in fields like Web mining and information retrieval. The traditional classifiers are usually hand-crafted and non-adaptive, that makes them inappropriate to use in a broader context. We introduce an ensemble of methods and the posterior study of its results to create a more generic and modular crawler and scraper for detection and information extraction on E-commerce web pages. The collected information may then be processed and used in the pricing decision. This framework goes by the name Prometheus and has the goal of extracting knowledge from E-commerce Web sites. The process requires crawling an online store and gathering product pages. This implies that given a web page the framework must be able to determine if it is a product page. In order to achieve this we classify the pages in three categories: catalogue, product and ”spam”. The page classification stage was addressed based on the html text as well as on the visual layout, featuring both traditional methods and Deep Learning approaches. Once a set of product pages has been identified we proceed to the extraction of the pricing information. This is not a trivial task due to the disparity of approaches to create a web page. Furthermore, most product pages are dynamic in the sense that they are truly a page for a family of related products. For instance, when visiting a shoe store, for a particular model there are probably a number of sizes and colours available. Such a model may be displayed in a single dynamic web page making it necessary for our framework to explore all the relevant combinations. This process is called scraping and is the last stage of the Prometheus framework.
Autores principais:Dias, João Tiago Pereira
Assunto:E-commerce Web mining Web page classification Machine learning Crawler Scraper Ciências Naturais::Ciências da Computação e da Informação
Ano:2019
País:Portugal
Tipo de documento:dissertação de mestrado
Tipo de acesso:acesso aberto
Instituição associada:Universidade do Minho
Idioma:inglês
Origem:RepositóriUM - Universidade do Minho
_version_ 1867438755323838464
author Dias, João Tiago Pereira
author_facet Dias, João Tiago Pereira
author_role author
contributor_name_str_mv Fernandes, António Ramires
RepositóriUM - Universidade do Minho
country_str PT
creators_json_txt [{\"Person.name\":\"Dias, João Tiago Pereira\"}]
datacite.contributors.contributor.contributorName.fl_str_mv Fernandes, António Ramires
RepositóriUM - Universidade do Minho
datacite.creators.creator.creatorName.fl_str_mv Dias, João Tiago Pereira
datacite.date.Accepted.fl_str_mv 2019-01-01T00:00:00Z
datacite.date.available.fl_str_mv 2020-08-25T14:39:14Z
datacite.date.embargoed.fl_str_mv 2020-08-25T14:39:14Z
datacite.rights.fl_str_mv http://purl.org/coar/access_right/c_abf2
datacite.subjects.subject.fl_str_mv E-commerce
Web mining
Web page classification
Machine learning
Crawler
Scraper
Ciências Naturais::Ciências da Computação e da Informação
datacite.titles.title.fl_str_mv Prometheus: a generic e-commerce crawler for the study of business markets and other e-commerce problems
dc.contributor.none.fl_str_mv Fernandes, António Ramires
RepositóriUM - Universidade do Minho
dc.creator.none.fl_str_mv Dias, João Tiago Pereira
dc.date.Accepted.fl_str_mv 2019-01-01T00:00:00Z
dc.date.available.fl_str_mv 2020-08-25T14:39:14Z
dc.date.embargoed.fl_str_mv 2020-08-25T14:39:14Z
dc.format.none.fl_str_mv application/pdf
dc.identifier.none.fl_str_mv https://hdl.handle.net/1822/66581
dc.language.none.fl_str_mv eng
dc.rights.none.fl_str_mv http://purl.org/coar/access_right/c_abf2
dc.subject.none.fl_str_mv E-commerce
Web mining
Web page classification
Machine learning
Crawler
Scraper
Ciências Naturais::Ciências da Computação e da Informação
dc.title.fl_str_mv Prometheus: a generic e-commerce crawler for the study of business markets and other e-commerce problems
dc.type.none.fl_str_mv http://purl.org/coar/resource_type/c_bdcc
description The continuous social and economic development has led over time to an increase in consumption, as well as greater demand from the consumer for better and cheaper products. Hence, the selling price of a product assumes a fundamental role in the purchase decision by the consumer. In this context, online stores must carefully analyse and define the best price for each product, based on several factors such as production/acquisition cost, positioning of the product (e.g. anchor product) and the competition companies strategy. The work done by market analysts changed drastically over the last years. As the number of Web sites increases exponentially, the number of E-commerce web sites also prosperous. Web page classification becomes more important in fields like Web mining and information retrieval. The traditional classifiers are usually hand-crafted and non-adaptive, that makes them inappropriate to use in a broader context. We introduce an ensemble of methods and the posterior study of its results to create a more generic and modular crawler and scraper for detection and information extraction on E-commerce web pages. The collected information may then be processed and used in the pricing decision. This framework goes by the name Prometheus and has the goal of extracting knowledge from E-commerce Web sites. The process requires crawling an online store and gathering product pages. This implies that given a web page the framework must be able to determine if it is a product page. In order to achieve this we classify the pages in three categories: catalogue, product and ”spam”. The page classification stage was addressed based on the html text as well as on the visual layout, featuring both traditional methods and Deep Learning approaches. Once a set of product pages has been identified we proceed to the extraction of the pricing information. This is not a trivial task due to the disparity of approaches to create a web page. Furthermore, most product pages are dynamic in the sense that they are truly a page for a family of related products. For instance, when visiting a shoe store, for a particular model there are probably a number of sizes and colours available. Such a model may be displayed in a single dynamic web page making it necessary for our framework to explore all the relevant combinations. This process is called scraping and is the last stage of the Prometheus framework.
dirty 0
eu_rights_str_mv openAccess
format masterThesis
fulltext.url.fl_str_mv https://repositorium.uminho.pt/bitstreams/c3a8e7d3-9a1a-40c9-9127-d01c3bce6204/download
id rum_ec4ccb89ec8ad661069dcbd91d984c08
identifier.url.fl_str_mv https://hdl.handle.net/1822/66581
instacron_str repositorium
institution Universidade do Minho
instname_str Universidade do Minho
language eng
network_acronym_str rum
network_name_str RepositóriUM - Universidade do Minho
oai_identifier_str oai:repositorium.uminho.pt:1822/66581
organization_str_mv urn:organizationAcronym:repositorium
person_str_mv Dias, João Tiago Pereira
publishDate 2019
reponame_str RepositóriUM - Universidade do Minho
repository_id_str urn:repositoryAcronym:rum
service_str_mv urn:repositoryAcronym:rum
spelling engporThe continuous social and economic development has led over time to an increase in consumption, as well as greater demand from the consumer for better and cheaper products. Hence, the selling price of a product assumes a fundamental role in the purchase decision by the consumer. In this context, online stores must carefully analyse and define the best price for each product, based on several factors such as production/acquisition cost, positioning of the product (e.g. anchor product) and the competition companies strategy. The work done by market analysts changed drastically over the last years. As the number of Web sites increases exponentially, the number of E-commerce web sites also prosperous. Web page classification becomes more important in fields like Web mining and information retrieval. The traditional classifiers are usually hand-crafted and non-adaptive, that makes them inappropriate to use in a broader context. We introduce an ensemble of methods and the posterior study of its results to create a more generic and modular crawler and scraper for detection and information extraction on E-commerce web pages. The collected information may then be processed and used in the pricing decision. This framework goes by the name Prometheus and has the goal of extracting knowledge from E-commerce Web sites. The process requires crawling an online store and gathering product pages. This implies that given a web page the framework must be able to determine if it is a product page. In order to achieve this we classify the pages in three categories: catalogue, product and ”spam”. The page classification stage was addressed based on the html text as well as on the visual layout, featuring both traditional methods and Deep Learning approaches. Once a set of product pages has been identified we proceed to the extraction of the pricing information. This is not a trivial task due to the disparity of approaches to create a web page. Furthermore, most product pages are dynamic in the sense that they are truly a page for a family of related products. For instance, when visiting a shoe store, for a particular model there are probably a number of sizes and colours available. Such a model may be displayed in a single dynamic web page making it necessary for our framework to explore all the relevant combinations. This process is called scraping and is the last stage of the Prometheus framework.application/pdfporPrometheus: a generic e-commerce crawler for the study of business markets and other e-commerce problemsDias, João Tiago PereiraFernandes, António RamiresHostingInstitutionOrganizationalRepositóriUM - Universidade do Minhoe-mailmailto:repositorium@usdb.uminho.ptrepositorium@usdb.uminho.ptTID2025032752020-08-25T14:39:14Z201920192019-01-01T00:00:00ZHandlehttps://hdl.handle.net/1822/66581http://purl.org/coar/access_right/c_abf2open accessE-commerceWeb miningWeb page classificationMachine learningCrawlerScraperhttp://www.oecd.org/science/inno/38235147.pdfFields of Science and Technology (FOS)Ciências Naturais::Ciências da Computação e da Informação23762107 bytesliteraturehttp://purl.org/coar/resource_type/c_bdccmaster thesishttp://purl.org/coar/access_right/c_abf2application/pdffulltexthttps://repositorium.uminho.pt/bitstreams/c3a8e7d3-9a1a-40c9-9127-d01c3bce6204/download
spellingShingle Prometheus: a generic e-commerce crawler for the study of business markets and other e-commerce problems
Dias, João Tiago Pereira
E-commerce
Web mining
Web page classification
Machine learning
Crawler
Scraper
Ciências Naturais::Ciências da Computação e da Informação
status SINGLETON
subject.fl_str_mv E-commerce
Web mining
Web page classification
Machine learning
Crawler
Scraper
subject.other.fl_str_mv Ciências Naturais::Ciências da Computação e da Informação
title Prometheus: a generic e-commerce crawler for the study of business markets and other e-commerce problems
title_full Prometheus: a generic e-commerce crawler for the study of business markets and other e-commerce problems
title_fullStr Prometheus: a generic e-commerce crawler for the study of business markets and other e-commerce problems
title_full_unstemmed Prometheus: a generic e-commerce crawler for the study of business markets and other e-commerce problems
title_short Prometheus: a generic e-commerce crawler for the study of business markets and other e-commerce problems
title_sort Prometheus: a generic e-commerce crawler for the study of business markets and other e-commerce problems
topic E-commerce
Web mining
Web page classification
Machine learning
Crawler
Scraper
Ciências Naturais::Ciências da Computação e da Informação
topic_facet E-commerce
Web mining
Web page classification
Machine learning
Crawler
Scraper
Ciências Naturais::Ciências da Computação e da Informação
url https://hdl.handle.net/1822/66581
visible 1