Publicação

End-to-End Pipeline For Analysing Media Coverage of Corruption in Portugal

Ver documento

Detalhes bibliográficos
Resumo:In this work, we propose a data pipeline for the collection and analysis of news articles on corruption and connected criminality from the Portuguese media outlets collaborating with our project. Our approach resides in utilizing media text to support analyses on perception of corruption, which until now have resorted to public questionnaires and expert scores [1]. Concretely, we make use of the Mediacloud API [2], through its Portugal - National geographical collection, and of webscraping techniques to construct a relational database with 18119 articles from a set of 14 Portuguese news sources over the period: 01/01/2015 - 31/12/2020. Two pre-trained named entity recognition taggers were compared in the scope of an internal manual annotation task, wherein we chose one model to automatically extract information from the articles collected, namely, information corresponding to Second HAREM’s selective scenario [3] - organizations, persons, locations, dates and values. Furthermore, we present three research avenues aimed at extracting insights from the database created and improving its usability: • gauging intensity and quality (founded vs unfounded claims) of corruption cases in local municipalities, as presented by the portuguese media, following the data collection of [4]; • enabling case studies by retrieving salient operations/cases from articles’ titles; • aggregating similar news articles, by experimenting with topic modelling techniques - LDA [5] and Top2Vec [6].
Autores principais:Marques, Afonso Manuel Cunha
Assunto:Corruption Media Big Data Local Governance
Ano:2022
País:Portugal
Tipo de documento:dissertação de mestrado
Tipo de acesso:acesso aberto
Instituição associada:Universidade Nova de Lisboa
Idioma:inglês
Origem:Repositório Institucional da UNL
Descrição
Resumo:In this work, we propose a data pipeline for the collection and analysis of news articles on corruption and connected criminality from the Portuguese media outlets collaborating with our project. Our approach resides in utilizing media text to support analyses on perception of corruption, which until now have resorted to public questionnaires and expert scores [1]. Concretely, we make use of the Mediacloud API [2], through its Portugal - National geographical collection, and of webscraping techniques to construct a relational database with 18119 articles from a set of 14 Portuguese news sources over the period: 01/01/2015 - 31/12/2020. Two pre-trained named entity recognition taggers were compared in the scope of an internal manual annotation task, wherein we chose one model to automatically extract information from the articles collected, namely, information corresponding to Second HAREM’s selective scenario [3] - organizations, persons, locations, dates and values. Furthermore, we present three research avenues aimed at extracting insights from the database created and improving its usability: • gauging intensity and quality (founded vs unfounded claims) of corruption cases in local municipalities, as presented by the portuguese media, following the data collection of [4]; • enabling case studies by retrieving salient operations/cases from articles’ titles; • aggregating similar news articles, by experimenting with topic modelling techniques - LDA [5] and Top2Vec [6].