Publicação

Development of a computational approach for the identification and annotation of transport proteins

Ver documento

Detalhes bibliográficos
Resumo:In the last decade, given the evolution of next-generation sequencing techniques, the number of sequenced genomes has grown exponentially [2]. The framework merlin [1], developed by the Biosystems research group (University of Minho) is a tool capable of generating genome-scale metabolic models. The identification of genes encoding transport proteins and the metabolites transported by them are essential tasks for the development of more robust and accurate genome-scale metabolic models. For this work, seven different machine learning models were trained and tested, using a five-fold cross validation process, on different datasets to identify and classify transport proteins. To prove the value of the developed models, four different datasets composed by well annotated proteins from TCDB and SwissProt were used. Ensembles of the models created using different datasets showed good overall performance with accuracy reaching 91% and low standard error; F1 scores reach 0.90 (+/- 0.00), making them a good solution for the identification and characterization of transport proteins given a new unannotated genome. The models used to identify transport proteins had a bigger number of false negatives compared to false positives (almost three times bigger) meaning that the confidence level of the classification of a protein as a transporter is high, and that these models miss a relevant number of transporter proteins that misclassified.
Autores principais:Faria, Daniel Torres Varzim
Assunto:Machine learning Transport proteins Models Characterization Linguagem máquina Proteínas transportadoras Modelos Caracterização
Ano:2016
País:Portugal
Tipo de documento:dissertação de mestrado
Tipo de acesso:acesso aberto
Instituição associada:Universidade do Minho
Idioma:inglês
Origem:RepositóriUM - Universidade do Minho
Descrição
Resumo:In the last decade, given the evolution of next-generation sequencing techniques, the number of sequenced genomes has grown exponentially [2]. The framework merlin [1], developed by the Biosystems research group (University of Minho) is a tool capable of generating genome-scale metabolic models. The identification of genes encoding transport proteins and the metabolites transported by them are essential tasks for the development of more robust and accurate genome-scale metabolic models. For this work, seven different machine learning models were trained and tested, using a five-fold cross validation process, on different datasets to identify and classify transport proteins. To prove the value of the developed models, four different datasets composed by well annotated proteins from TCDB and SwissProt were used. Ensembles of the models created using different datasets showed good overall performance with accuracy reaching 91% and low standard error; F1 scores reach 0.90 (+/- 0.00), making them a good solution for the identification and characterization of transport proteins given a new unannotated genome. The models used to identify transport proteins had a bigger number of false negatives compared to false positives (almost three times bigger) meaning that the confidence level of the classification of a protein as a transporter is high, and that these models miss a relevant number of transporter proteins that misclassified.