Publicação
Ukraine War: an online corpus to analyze the impact of the war in Ukraine
| Resumo: | This document reports a Master’s work, the final project of the 5th year of the Integrated Master’s in Informatics Engineering, that was accomplished at Universidade do Minho in Braga, Portugal. On February 24, 2022, a conflict between two countries, Ukraine and Russia, began. The war between two countries is devastating and affects many people, both residents of the countries directly involved and neighboring countries. As a highly significant event, it gathers coverage from many sources globally, including traditional print newspapers, online news platforms, social networks, blogs, television programs, and more. However, all of this information is scattered across different websites and social networks. If researchers (in the areas of Linguistics, History, Humanities, etc.) and curious people want to analyze this data, their work will be very difficult. Therefore, it is essential to gather the information on a single platform. This work aims to create an online corpus in the Portuguese language regarding the Ukraine War, based on Portuguese online newspapers’ news as well as comments on social media. To fulfill the goal of this work, initially, a variety of news sources were considered, and the Portuguese online newspapers “Público” and “Jornal de Negócios” were selected, as well as the platform “Reddit”. To extract the required information, the technique of Web Scraping was used. Therefore, for each source, an extractor was developed that extracted the necessary information and saved it in a JSON file. Following that, Natural Language Processing Techniques were used to process the gathered information. Afterward, the extracted information was stored in a non-relational database, MongoDB. Finally, a website called GUCO was designed and implemented, providing users with the capability to navigate and explore the created corpus. The GUCO website is available at the address: https://guco.epl.di.uminho.pt/. |
|---|---|
| Autores principais: | Rosendo, Ana Rita Miranda |
| Assunto: | Online corpus Ukraine war Rebuild the war through the news Social network analysis Web scraping Natural language processing Corpus online Guerra da Ucrânia Reconstrução da guerra através das notícias Análise de redes sociais Web scraping Processamento de linguagem natural |
| Ano: | 2024 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Universidade do Minho |
| Idioma: | inglês |
| Origem: | RepositóriUM - Universidade do Minho |
| Resumo: | This document reports a Master’s work, the final project of the 5th year of the Integrated Master’s in Informatics Engineering, that was accomplished at Universidade do Minho in Braga, Portugal. On February 24, 2022, a conflict between two countries, Ukraine and Russia, began. The war between two countries is devastating and affects many people, both residents of the countries directly involved and neighboring countries. As a highly significant event, it gathers coverage from many sources globally, including traditional print newspapers, online news platforms, social networks, blogs, television programs, and more. However, all of this information is scattered across different websites and social networks. If researchers (in the areas of Linguistics, History, Humanities, etc.) and curious people want to analyze this data, their work will be very difficult. Therefore, it is essential to gather the information on a single platform. This work aims to create an online corpus in the Portuguese language regarding the Ukraine War, based on Portuguese online newspapers’ news as well as comments on social media. To fulfill the goal of this work, initially, a variety of news sources were considered, and the Portuguese online newspapers “Público” and “Jornal de Negócios” were selected, as well as the platform “Reddit”. To extract the required information, the technique of Web Scraping was used. Therefore, for each source, an extractor was developed that extracted the necessary information and saved it in a JSON file. Following that, Natural Language Processing Techniques were used to process the gathered information. Afterward, the extracted information was stored in a non-relational database, MongoDB. Finally, a website called GUCO was designed and implemented, providing users with the capability to navigate and explore the created corpus. The GUCO website is available at the address: https://guco.epl.di.uminho.pt/. |
|---|