Abstract: Many of the challenges facing today’s reinforcement learning (RL) algorithms, such as robustness, generalization, transfer, and computational efficiency, are symptoms of a an underlying problem: RL agents use too many bits of information from the observations. Prior work has convincingly argued why minimizing information is useful in the supervised learning setting. The RL setting is unique because (1) its sequential nature allows an agent to use past information to avoid looking at future observations and (2) the agent can optimize its behavior to prefer states where decision making requires few bits. We take advantage of these properties to propose a method for learning simple policies. This method brings together ideas from information bottlenecks, model-based RL, and bits-back coding into a simple and theoretically-justified algorithm. Our method jointly optimizes a latent-space model and policy to be self-consistent, such that the policy avoids states where the model is inaccurate. We demonstrate that our method achieves higher reward per bit than prior methods, and learns compressed policies that are (provably and empirically) more robust and generalize better to new tasks.
Videos of Learned Policies
Driving with Traffic: The agent controls the green car and is rewarded for moving to the right. Agents trained with tighter bitrate constraints (bottom) are more conservative and pass fewer cars.
Active Cruise Control: The agent controls the green car and is rewarded for moving to the right; this agent cannot pass other cars. In the videos below, the bottom frames visualize when the agent is observing many bits of input; darker frames indicate that the agent is ignoring most of the observation. Observe that the policy trained with a larger bitrate constraint (1 bit on left vs 0.1 bits on right) observes fewer bits. Both policies use more bits when tailgating.