Publicação

Exploration of machine learning and deep learning approaches to predicting enzyme cofactor binding

Detalhes bibliográficos
Resumo:	Enzymes may depend on cofactors to carry out essential biochemical reactions, making accurate cofactor prediction crucial for understanding metabolism and enhancing Genome-Scale Metabolic (GSM) models. Traditional homology-based approaches often struggle to capture the complexity and diversity of enzymecofactor interactions across different species. This dissertation presents a machine learning (ML) and deep learning (DL) workflow, employing Convolutional Neural Networks (CNNs), to predict enzyme cofactor binding based on protein sequence embeddings such as ESM2, SeqVec, and FlashProt. The dataset, consisting of 73,307 protein sequences annotated with 13 cofactors, was obtained from key databases including UniProt, BRENDA, Rhea, and ChEBI. The CNN models, particularly those leveraging ESM2 embeddings, outperformed others in multi-label classification tasks, achieving high accuracy, low hamming loss, and high F1 scores. These models excelled in predicting cofactor binding for well-characterized organisms like Escherichia coli, where extensive data is available. However, predictions for less-studied organisms, such as Helicobacter pylori and Synechocystis sp. PCC 6803, showed reduced accuracy, underscoring the need for more comprehensive species-specific data. In addition to the prediction workflow, a user-friendly web service was developed, allowing researchers to upload protein sequences and GSM models to receive cofactor predictions and update model reconstructions. This workflow provides a valuable tool for advancing metabolic research, enabling more accurate cofactor predictions and improving GSM model reconstructions. Future work may focus on expanding the dataset, integrating additional data types like protein structures and enzyme activity, and further exploring advanced DL techniques such as transformers to enhance performance and generalizability across species.
Autores principais:	Gonçalves, Joana Oliveira
Assunto:	Machine learning Deep learning Multi-label classification Convolutional Neural Networks Cofactors Aprendizagem de máquina Aprendizagem profunda Classificação multi-categoria Redes Neuronais Convolucionais Cofatores
Ano:	2024
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	Universidade do Minho
Idioma:	português
Origem:	RepositóriUM - Universidade do Minho

Descrição
Resumo:	Enzymes may depend on cofactors to carry out essential biochemical reactions, making accurate cofactor prediction crucial for understanding metabolism and enhancing Genome-Scale Metabolic (GSM) models. Traditional homology-based approaches often struggle to capture the complexity and diversity of enzymecofactor interactions across different species. This dissertation presents a machine learning (ML) and deep learning (DL) workflow, employing Convolutional Neural Networks (CNNs), to predict enzyme cofactor binding based on protein sequence embeddings such as ESM2, SeqVec, and FlashProt. The dataset, consisting of 73,307 protein sequences annotated with 13 cofactors, was obtained from key databases including UniProt, BRENDA, Rhea, and ChEBI. The CNN models, particularly those leveraging ESM2 embeddings, outperformed others in multi-label classification tasks, achieving high accuracy, low hamming loss, and high F1 scores. These models excelled in predicting cofactor binding for well-characterized organisms like Escherichia coli, where extensive data is available. However, predictions for less-studied organisms, such as Helicobacter pylori and Synechocystis sp. PCC 6803, showed reduced accuracy, underscoring the need for more comprehensive species-specific data. In addition to the prediction workflow, a user-friendly web service was developed, allowing researchers to upload protein sequences and GSM models to receive cofactor predictions and update model reconstructions. This workflow provides a valuable tool for advancing metabolic research, enabling more accurate cofactor predictions and improving GSM model reconstructions. Future work may focus on expanding the dataset, integrating additional data types like protein structures and enzyme activity, and further exploring advanced DL techniques such as transformers to enhance performance and generalizability across species.

Atividades financiadas

Carregando projetos financiados...