**C-Planning: An Automatic Curriculum for Learning Goal-Reaching Tasks**

Tianjun Zhang,   Benjamin Eysenbach,   Ruslan Salakhutdinov,   Sergey Levine,   Joseph E. Gonzalez

Paper,   Code

![**C-Planning** is an algorithm for goal-conditioned RL that uses an automatic curriculum of waypoints to learn policies that can solve complex tasks, such as manipulating multiple objects in sequence to achieve a goal, without requiring any reward functions, manual distance functions, or human demonstrations.](snapshots.png) ![**C-Planning** samples waypoints that are reachable from the initial state, and from which the agent can reach the goal. This maximizes a lower bound on goal reaching objective](way_sampling_v2.png width=50%) *__Abstract__*: Goal-conditioned reinforcement learning (RL) has shown great success recently at solving a wide range of tasks(e.g., navigation, robotic manipulation). However, learning to reach distant goals remains a central challenge to the field, and the task is particularly hard without any offline data, expert demonstrations, and reward shaping. In this paper, we propose to solve the distant goal-reaching task by using search at training time to generate a curriculum of intermediate states. Specifically, we introduce the algorithm Classifier-Planning (C-Planning) by framing the learning of the goal-conditioned policies as variational inference. C-Planning naturally follows expectation maximization (EM): the E-step corresponds to planning an optimal sequence of waypoints using graph search, while the M-step aims to learn a goal-conditioned policy to reach those waypoints. One essential difficulty of designing such an algorithm is accurately modeling the distribution over waypoints to sample from. In C-Planning, we propose to sample the waypoints using contrastive methods to learn a value function. Unlike prior methods that combine goal-conditioned RL with graph search, ours performs search only during training and not testing, significantly decreasing the compute costs of deploying the learned policy. Empirically, we demonstrate that our method not only improve the sample efficiency of prior methods, but also successfully solves temporally-extended navigation and manipulation tasks, where prior goal-conditioned RL methods (including those based on graph search) fail to solve. --------------------------