Publicação
Q-Learning applied to games: a reward focused study
| Resumo: | Q-Learning is one of the most popular reinforcement learning algorithms. It can solve different complex problems with interesting tasks where decisions have to be made, all the while using the same algorithm with no interfer ence from the developer about specific strategies. This is achieved by processing a reward received after each decision is made. In order to evaluate the performance of Q-Learning on different problems, video games prove to be a great asset for testing purposes, as each game has its own unique mechanics and some kind of objective that needs to be learned. Furthermore, the results from testing different algorithms on the same conditions can be easily compared. This thesis presents a study on Q-Learning, from its origins and how it operates, showcasing various state of the art techniques used to improve the algorithm and detailing the procedures that have become standard when training Q-Learning agents to play video games for the Atari 2600. Our implementation of the algorithm following the same techniques and procedures is ran on different video games. The training performance is compared to the one obtained in articles that trained on the same games and attained state of the art performance. Additionally, we explored crafting new reward schemes modifying game default rewards. Various custom rewards were created and combined to evaluate how they affect performance. During these tests, we found that the use of rewards that inform about both good and bad behaviour led to better performance, as opposed to rewards that only inform about good behaviour, which is done by default in some games. It was also found that the use of more game specific rewards could attain better results, but these also required a more careful analysis of each game, not being easily transferable into other games. As a more general approach, we tested reward changes that could incentivize exploration for games that were harder to navigate, and thus harder to learn from. We found that not only did these changes improve exploration, but they also improved the performance obtained after some parameter tuning. These algorithms are designed to teach the agent to accumulate rewards. But how does this relate to game score? To assess this question, we present some preliminary experiments showing the relationship between the evolution of reward accumulation and game score. |
|---|---|
| Autores principais: | Ferreira, Pedro Henrique de Passos |
| Assunto: | Q-Learning Policy Markov processes Neural networks Política Cadeias de Markov Redes neuronais |
| Ano: | 2023 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Universidade do Minho |
| Idioma: | inglês |
| Origem: | RepositóriUM - Universidade do Minho |
| Resumo: | Q-Learning is one of the most popular reinforcement learning algorithms. It can solve different complex problems with interesting tasks where decisions have to be made, all the while using the same algorithm with no interfer ence from the developer about specific strategies. This is achieved by processing a reward received after each decision is made. In order to evaluate the performance of Q-Learning on different problems, video games prove to be a great asset for testing purposes, as each game has its own unique mechanics and some kind of objective that needs to be learned. Furthermore, the results from testing different algorithms on the same conditions can be easily compared. This thesis presents a study on Q-Learning, from its origins and how it operates, showcasing various state of the art techniques used to improve the algorithm and detailing the procedures that have become standard when training Q-Learning agents to play video games for the Atari 2600. Our implementation of the algorithm following the same techniques and procedures is ran on different video games. The training performance is compared to the one obtained in articles that trained on the same games and attained state of the art performance. Additionally, we explored crafting new reward schemes modifying game default rewards. Various custom rewards were created and combined to evaluate how they affect performance. During these tests, we found that the use of rewards that inform about both good and bad behaviour led to better performance, as opposed to rewards that only inform about good behaviour, which is done by default in some games. It was also found that the use of more game specific rewards could attain better results, but these also required a more careful analysis of each game, not being easily transferable into other games. As a more general approach, we tested reward changes that could incentivize exploration for games that were harder to navigate, and thus harder to learn from. We found that not only did these changes improve exploration, but they also improved the performance obtained after some parameter tuning. These algorithms are designed to teach the agent to accumulate rewards. But how does this relate to game score? To assess this question, we present some preliminary experiments showing the relationship between the evolution of reward accumulation and game score. |
|---|