Publicação

R-seqQI: RNA-Seq Quality Indicator

Detalhes bibliográficos
Resumo:	The current progress of sequencing systems facilitates the sequencing of the genomes and transcriptomes of countless organisms on our planet. However, it is not simple to measure the quality of the processed data, mainly in the study of non-model organisms, for which there is little if any, information available. The Korf Lab developed a method for the evaluation of genomes integrity, through the identification of 248 core eukaryotic genes (CEGs) that are present in nearly all of the eukaryotes. The main goal of this work is to evaluate the use of the CEGs in RNA-Seq of non-model organisms. For that two software’s were developed: seqQIrefmetrics to calculate a set of referencebased quality metrics, including identification, chimerism, accuracy and contiguity, based on the literature, and three new metrics, comprising fragmentation(1,2,3,4,5+), coverage and non-match, increasing the number of metrics available for transcriptome quality assessment; and seqQIidentifyCEGs to identify and report the number of CEGs present in each transcriptome assembly. To carry out the main objective, RNA-Seq data from nine model organisms (Arabidopsis thaliana, Aspergillus nidulans, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Mus musculus, Oryza sativa, Saccharomyces cerevisiae and Xenopus tropicalis), processed with Trinity, were used to evaluate how CEG detection correlates with the quality of the transcriptomes. In order to identify CEGs, protein sequences from assembled transcripts were predicted with TransDecoder. Metrics calculated by seqQIrefmetrics were associated with the number of CEGs identified by seqQIidentifyCEGs in each assembled transcriptome, through linear regressions. Among these metrics only contiguity and coverage were used to create predictive models, achieving an R2 of 0.787 and 0.640; and a RMSE of 5.86 and 6.90, respectively. These findings indicate that the CEGs can be used as a quality tool. In fact, the linear regressions enable to infer prospectively the quality of the assembled transcripts, without the necessity of additional information, such as a reference genome sequence or structural annotations. This approach is extremely important for RNA-Seq of non-model organisms, where there is no such information to evaluate the quality of the assembled transcripts in a reliable manner.
Autores principais:	Sousa, Abel Ernesto Fernandes de
Assunto:	Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
Ano:	2016
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	Universidade do Minho
Idioma:	inglês
Origem:	RepositóriUM - Universidade do Minho

Descrição
Resumo:	The current progress of sequencing systems facilitates the sequencing of the genomes and transcriptomes of countless organisms on our planet. However, it is not simple to measure the quality of the processed data, mainly in the study of non-model organisms, for which there is little if any, information available. The Korf Lab developed a method for the evaluation of genomes integrity, through the identification of 248 core eukaryotic genes (CEGs) that are present in nearly all of the eukaryotes. The main goal of this work is to evaluate the use of the CEGs in RNA-Seq of non-model organisms. For that two software’s were developed: seqQIrefmetrics to calculate a set of referencebased quality metrics, including identification, chimerism, accuracy and contiguity, based on the literature, and three new metrics, comprising fragmentation(1,2,3,4,5+), coverage and non-match, increasing the number of metrics available for transcriptome quality assessment; and seqQIidentifyCEGs to identify and report the number of CEGs present in each transcriptome assembly. To carry out the main objective, RNA-Seq data from nine model organisms (Arabidopsis thaliana, Aspergillus nidulans, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Mus musculus, Oryza sativa, Saccharomyces cerevisiae and Xenopus tropicalis), processed with Trinity, were used to evaluate how CEG detection correlates with the quality of the transcriptomes. In order to identify CEGs, protein sequences from assembled transcripts were predicted with TransDecoder. Metrics calculated by seqQIrefmetrics were associated with the number of CEGs identified by seqQIidentifyCEGs in each assembled transcriptome, through linear regressions. Among these metrics only contiguity and coverage were used to create predictive models, achieving an R2 of 0.787 and 0.640; and a RMSE of 5.86 and 6.90, respectively. These findings indicate that the CEGs can be used as a quality tool. In fact, the linear regressions enable to infer prospectively the quality of the assembled transcripts, without the necessity of additional information, such as a reference genome sequence or structural annotations. This approach is extremely important for RNA-Seq of non-model organisms, where there is no such information to evaluate the quality of the assembled transcripts in a reliable manner.