Publicação
Exploration of machine learning and deep learning approaches to predicting enzyme cofactor binding
| Resumo: | Enzymes may depend on cofactors to carry out essential biochemical reactions, making accurate cofactor prediction crucial for understanding metabolism and enhancing Genome-Scale Metabolic (GSM) models. Traditional homology-based approaches often struggle to capture the complexity and diversity of enzymecofactor interactions across different species. This dissertation presents a machine learning (ML) and deep learning (DL) workflow, employing Convolutional Neural Networks (CNNs), to predict enzyme cofactor binding based on protein sequence embeddings such as ESM2, SeqVec, and FlashProt. The dataset, consisting of 73,307 protein sequences annotated with 13 cofactors, was obtained from key databases including UniProt, BRENDA, Rhea, and ChEBI. The CNN models, particularly those leveraging ESM2 embeddings, outperformed others in multi-label classification tasks, achieving high accuracy, low hamming loss, and high F1 scores. These models excelled in predicting cofactor binding for well-characterized organisms like Escherichia coli, where extensive data is available. However, predictions for less-studied organisms, such as Helicobacter pylori and Synechocystis sp. PCC 6803, showed reduced accuracy, underscoring the need for more comprehensive species-specific data. In addition to the prediction workflow, a user-friendly web service was developed, allowing researchers to upload protein sequences and GSM models to receive cofactor predictions and update model reconstructions. This workflow provides a valuable tool for advancing metabolic research, enabling more accurate cofactor predictions and improving GSM model reconstructions. Future work may focus on expanding the dataset, integrating additional data types like protein structures and enzyme activity, and further exploring advanced DL techniques such as transformers to enhance performance and generalizability across species. |
|---|---|
| Autores principais: | Gonçalves, Joana Oliveira |
| Assunto: | Machine learning Deep learning Multi-label classification Convolutional Neural Networks Cofactors Aprendizagem de máquina Aprendizagem profunda Classificação multi-categoria Redes Neuronais Convolucionais Cofatores |
| Ano: | 2024 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Universidade do Minho |
| Idioma: | português |
| Origem: | RepositóriUM - Universidade do Minho |
| Resumo: | Enzymes may depend on cofactors to carry out essential biochemical reactions, making accurate cofactor prediction crucial for understanding metabolism and enhancing Genome-Scale Metabolic (GSM) models. Traditional homology-based approaches often struggle to capture the complexity and diversity of enzymecofactor interactions across different species. This dissertation presents a machine learning (ML) and deep learning (DL) workflow, employing Convolutional Neural Networks (CNNs), to predict enzyme cofactor binding based on protein sequence embeddings such as ESM2, SeqVec, and FlashProt. The dataset, consisting of 73,307 protein sequences annotated with 13 cofactors, was obtained from key databases including UniProt, BRENDA, Rhea, and ChEBI. The CNN models, particularly those leveraging ESM2 embeddings, outperformed others in multi-label classification tasks, achieving high accuracy, low hamming loss, and high F1 scores. These models excelled in predicting cofactor binding for well-characterized organisms like Escherichia coli, where extensive data is available. However, predictions for less-studied organisms, such as Helicobacter pylori and Synechocystis sp. PCC 6803, showed reduced accuracy, underscoring the need for more comprehensive species-specific data. In addition to the prediction workflow, a user-friendly web service was developed, allowing researchers to upload protein sequences and GSM models to receive cofactor predictions and update model reconstructions. This workflow provides a valuable tool for advancing metabolic research, enabling more accurate cofactor predictions and improving GSM model reconstructions. Future work may focus on expanding the dataset, integrating additional data types like protein structures and enzyme activity, and further exploring advanced DL techniques such as transformers to enhance performance and generalizability across species. |
|---|
Atividades financiadas
Carregando projetos financiados...