By Arthur Juliani, Senior Software Engineer, Machine Learning, Unity Technologies
Now there is an easy way to encourage agents to explore the environment more effectively when the rewards are infrequent and sparsely distributed. These agents can do this using a reward they give themselves based on how surprised they are about the outcome of their actions. In this post, I will explain how this new system works, and then show how we can use it to help our agent solve a task that would otherwise be much more difficult for a vanilla Reinforcement Learning (RL) algorithm to solve.
When it comes to Reinforcement Learning, the primary learning signal comes in the form of the reward: a scalar value provided to the agent after every decision it makes. This reward is typically provided by the environment itself and specified by the creator of the environment. These rewards often correspond to things like +1.0 for reaching the goal, -1.0 for dying, etc. We can think of this kind of rewards as being extrinsic because they come from outside the agent. If there are extrinsic rewards, then that means there must be intrinsic ones too. Rather than being provided by the environment, intrinsic rewards are generated by the agent itself based on some criteria. Of course, not any intrinsic reward would do. We want intrinsic rewards which ultimately serve some purpose, such as changing the agent’s behavior such that it will get even greater extrinsic rewards in the future, or that the agent will explore the world more than it might have otherwise. In humans and other mammals, the pursuit of these intrinsic rewards is often referred to as intrinsic motivation and tied closely to our feelings of agency.
Researchers in the field of Reinforcement Learning have put a lot of thought into developing good systems for providing intrinsic rewards to agents which endow them with similar motivation as we find in nature’s agents. One popular approach is to endow the agent with a sense of curiosity and to reward it based on how surprised it is by the world around it. If you think about how a young baby learns about the world, it isn’t pursuing any specific goal, but rather playing and exploring for the novelty of the experience. You can say that the child is curious. The idea behind curiosity-driven exploration is to instill this kind of motivation into our agents. If the agent is rewarded for reaching states which are surprising to it, then it will learn strategies to explore the environment to find more and more surprising states. Along the way, the agent will hopefully also discover the extrinsic reward as well, such as a distant goal position in a maze, or sparse resource on a landscape.
We chose to implement one specific such approach from a recent paper released last year by Deepak Pathak and his colleagues at Berkeley. It is called Curiosity-driven Exploration by Self-supervised Prediction, and you can read the paper hereif you are interested in the full details. In the paper, the authors formulate the idea of curiosity in a clever and generalizable way. They propose to train two separate neural-networks: a forward and an inverse model. The inverse model is trained to take the current and next observation received by the agent, encode them both using a single encoder, and use the result to predict the action that was taken between the occurrence of the two observations. The forward model is then trained to take the encoded current observation and action and predict the encoded next observation. The difference between the predicted and real encodings is then used as the intrinsic reward, and fed to the agent. Bigger difference means bigger surprise, which in turn means bigger intrinsic reward.
Read the source post on the Unity blog.