Publicação

SAQL: query language for corpora with morpho-syntactic annotation

Detalhes bibliográficos
Resumo:	Computer Mediated Communication becomes more prevalent with each passing day, be it in social media, blogs or forums. These mediums gather large amounts of people from different backgrounds and provide places where opposing ideals can clash. This can devolve into attacks, resorting to inappropriate language and, in more extreme cases, hate speech. The detection of these cases is a problem, due to the large amount of data posted online and due to the language itself. The various idiosyncrasies of language restrict the automatic classification efforts. The aim of this thesis was to develop a system capable of processing texts, identifying and annotating within them certain syntactic patterns typically present in hate speech. This main purpose can be split in two different goals: morpho-syntactic annotation of online texts, creating a query engine to search for patterns present in the corpus; and identify and classify the occurrence of hate speech in an online medium. As a case study, the corpus extracted from online platforms by the NetLang Project was used. To fulfill these goals, a pre-processing system was implemented, the resulting annotations feeding both the classification system and the query system. The hate speech classification system was developed adopting a mixed methodology, employing manual linguistic analysis to the results arising out of the automatic methods in order to classify instances of hate speech. The system was tested and the results were compared with the statistical classification. The query system consisted in the formulation of the query language and the creation of the respective query engine which allows to search the annotated corpus for particular sequences in the texts. To evaluate the usability of the query engine, an experiment was carried out, gathering feedback from possible final users.
Autores principais:	Pereira, Ana Filipa Vilela
Assunto:	Computer mediated communication Hate Speech Classification Morpho-syntactic annotation Natural language processing Classificação de discurso de ódio Comunicação mediada por computador Etiquetação morfossintática Processamento de linguagem natural
Ano:	2022
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	Universidade do Minho
Idioma:	espanhol
Origem:	RepositóriUM - Universidade do Minho

Descrição
Resumo:	Computer Mediated Communication becomes more prevalent with each passing day, be it in social media, blogs or forums. These mediums gather large amounts of people from different backgrounds and provide places where opposing ideals can clash. This can devolve into attacks, resorting to inappropriate language and, in more extreme cases, hate speech. The detection of these cases is a problem, due to the large amount of data posted online and due to the language itself. The various idiosyncrasies of language restrict the automatic classification efforts. The aim of this thesis was to develop a system capable of processing texts, identifying and annotating within them certain syntactic patterns typically present in hate speech. This main purpose can be split in two different goals: morpho-syntactic annotation of online texts, creating a query engine to search for patterns present in the corpus; and identify and classify the occurrence of hate speech in an online medium. As a case study, the corpus extracted from online platforms by the NetLang Project was used. To fulfill these goals, a pre-processing system was implemented, the resulting annotations feeding both the classification system and the query system. The hate speech classification system was developed adopting a mixed methodology, employing manual linguistic analysis to the results arising out of the automatic methods in order to classify instances of hate speech. The system was tested and the results were compared with the statistical classification. The query system consisted in the formulation of the query language and the creation of the respective query engine which allows to search the annotated corpus for particular sequences in the texts. To evaluate the usability of the query engine, an experiment was carried out, gathering feedback from possible final users.