This article presents a study based on 18th-century Portuguese texts, focusing on the analysis of named entities to enhance their value for historical research. For that, an annotated corpus was developed using a primary source (the Parish Memories), which was transcribed, revised, and standardised. The distribution of named entities in the source was then analysed to reflect on the variations in the defined ca...
Este artigo apresenta um estudo baseado em textos portugueses do século XVIII, através da análise de entidades nomeadas, tendo em vista potenciá-las para análise histórica. Para isso foi elaborado um corpus anotado, a partir de uma fonte (Memórias Paroquiais) transcrita, revista e normalizada. Posteriormente, realizou-se uma análise da distribuição das entidades nomeadas na fonte em apreço, para refletir sobre ...
This paper discusses the impact of Portuguese variants in Large Language Models for the task of named entity recognition (NER) in specialised domains. The tests were made on a Brazilian Portuguese legal and a European Portuguese historical corpora. The models taken into account are BERTimbau (PT-BR), Albertina (PT-PT and PT-BR), and XML-R (multilingual). The impact was more evident in the Portuguese historical ...
Memórias Paroquiais-Alentejo (1758) collects the responses of the parish priests from the largest region of Portugal (Alentejo) to a survey carried out by the Crown, asking about the state of the territory and its populations, and also about the effects of the earthquake 1755. This article discusses the transformative process from the manuscripts up to the processable digital stage. We described some individual...
This paper presents the construction of a corpus and the respective models learned for the Named Entity Recognition (NER) task, specialised for historical research. The entity categories were adapted based on the objectives of the historical analysis of the 18th-century text. We trained and evaluated traditional neural networks and the new Large Language Models (LLMs) for the NER task. In total, we assessed six...
Na área de HD, relativamente aos trabalhos baseados em fontes textuais, encontramos uma grande variação, tanto nos períodos históricos das fontes, no seu suporte (manuscritos em papel, impressos, fotografados, etc), como no seu estágio de digitalização, que pode variar entre imagens digitais, textos em PDF e textos digitalizados em outros formatos. Todas essas variações adicionam esforços extras de processament...
This paper presents a distribution analysis of named entities in a historical source, an 18th century Portuguese text collection. The source has been transcribed, revised, normalised and annotated manually with the help of an annotation tool. The distribution analysis was carried out automatically with the help of an extraction parser applied to the annotated texts. The central question of this text is to analy...
This paper reflects on the whole path of work in digital humanities, on the light of the projects related to text processing under development at CIDEHUS. These projects deal with a rich heritage related to the Portuguese culture, history and language. This paper reflects on the many challenges to be faced and how NLP techniques may broaden the capabilities of organising and sharing knowledge related to these r...
This paper reviews a stage of the process of annotating named entities in 18th-century texts to enrich historical research sources and link them to other bases. The categories in question are person, location and organisation, valid categories for historian analysis. We discuss the difficulties observed in the process and point eventual solutions.; Partially supported by the Portuguese Foundation FCT, under the...
This work presents an enriched version of the Parish Memories (1758–1761), an essential Portuguese historical source manually transcribed. It is enriched with annotations of named entities of the types PERSON, LOCATION, and ORGANIZATION. The annotation was done automatically for the whole collection where two researchers annotated a portion of it manually for evaluation purposes. In this dataset, we provide the...