Detalhes do Documento

Developing reliability metrics and validation tools for datasets with deep linguistic information

Autor(es): Castro, Sérgio Ricardo de

Data: 2011

Identificador Persistente: http://hdl.handle.net/10451/13908

Origem: Repositório da Universidade de Lisboa

Assunto(s): Natural language processing; corpora annotation with deep linguistic information; inter-annotator agreement


Descrição

The purpose of this dissertation is to propose a reliability metric and respective validation tools for corpora annotated with deep linguistic information. The annotation of corpus with deep linguistic information is a complex task, and therefore is aided by a computational grammar. This grammar generates all the possible grammatical representations for sentences. The human annotators select the most correct analysis for each sentence, or reject it if no suitable representation is achieved. This task is repeated by two human annotators under a double-blind annotation scheme and the resulting annotations are adjudicated by a third annotator. This process should result in reliable datasets since the main purpose of this dataset is to be the training and validation data for other natural language processing tools. Therefore it is necessary to have a metric that assures such reliability and quality. In most cases, the metrics uses for shallow annotation or parser evaluation have been used for this same task. However the increased complexity demands a better granularity in order to properly measure the reliability of the dataset. With that in mind, I suggest the usage of a metric based on the Cohen’s Kappa metric that instead of considering the assignment of tags to parts of the sentence, considers the decision at the level of the semantic discriminants, the most granular unit available for this task. By comparing each annotator’s options it is possible to evaluate with a high degree of granularity how close their analysis were for any given sentence. An application was developed that allowed the application of this model to the data resulting from the annotation process which was aided by the LOGON framework. The output of this application not only has the metric for the annotated dataset, but some information related with divergent decision with the intent of aiding the adjudication process.

Tipo de Documento Dissertação de mestrado
Idioma Inglês
Orientador(es) Branco, António
Contribuidor(es) Repositório da Universidade de Lisboa
facebook logo  linkedin logo  twitter logo 
mendeley logo

Documentos Relacionados

Não existem documentos relacionados.