Throughout purely curious exploration, the JACO arm discovers how you can decide up cubes, strikes them across the workspace and even explores whether or not they are often balanced on their edges.
Curious exploration allows OP3 to stroll upright, steadiness on one foot, sit down and even catch itself safely when leaping backwards – all with no particular goal activity to optimise for.
Intrinsic motivation [1, 2] generally is a highly effective idea to endow an agent with a mechanism to constantly discover its surroundings within the absence of activity data. One widespread strategy to implement intrinsic motivation is through curiosity studying [3, 4]. With this methodology, a predictive mannequin concerning the surroundings’s response to an agent’s actions is skilled alongside the agent’s coverage. This mannequin can be known as a world mannequin. When an motion is taken, the world mannequin makes a prediction concerning the agent’s subsequent statement. This prediction is then in comparison with the true statement made by the agent. Crucially, the reward given to the agent for taking this motion is scaled by the error it made when predicting the following statement. This fashion, the agent is rewarded for taking actions whose outcomes aren’t but nicely predictable. Concurrently, the world mannequin is up to date to raised predict the result of mentioned motion.
This mechanism has been utilized efficiently in on-policy settings, e.g. to beat 2D pc video games in an unsupervised manner  or to coach a common coverage which is well adaptable to concrete downstream duties . Nonetheless, we consider that the true energy of curiosity studying lies within the various behaviour which emerges in the course of the curious exploration course of: Because the curiosity goal modifications, so does the ensuing behaviour of the agent thereby discovering many complicated insurance policies which might be utilised afterward, in the event that they have been retained and never overwritten.
On this paper, we make two contributions to check curiosity studying and harness its emergent behaviour: First, we introduce SelMo, an off-policy realisation of a self-motivated, curiosity-based methodology for exploration. We present that utilizing SelMo, significant and various behaviour emerges solely primarily based on the optimisation of the curiosity goal in simulated manipulation and locomotion domains. Second, we suggest to increase the main focus within the software of curiosity studying in the direction of the identification and retention of rising intermediate behaviours. We help this conjecture with an experiment which reloads self-discovered behaviours as pretrained, auxiliary abilities in a hierarchical reinforcement studying setup.
We run SelMo in two simulated steady management robotic domains: On a 6-DoF JACO arm with a three-fingered gripper and on a 20-DoF humanoid robotic, the OP3. The respective platforms current difficult studying environments for object manipulation and locomotion, respectively. Whereas solely optimising for curiosity, we observe that complicated human-interpretable behaviour emerges over the course of the coaching runs. As an example, JACO learns to choose up and transfer cubes with none supervision or the OP3 learns to steadiness on a single foot or sit down safely with out falling over.
Nonetheless, the spectacular behaviours noticed throughout curious exploration have one essential disadvantage: They aren’t persistent as they maintain altering with the curiosity reward operate. Because the agent retains repeating a sure behaviour, e.g. JACO lifting the purple dice, the curiosity rewards amassed by this coverage are diminishing. Consequently, this results in the educational of a modified coverage which acquires larger curiosity rewards once more, e.g. shifting the dice exterior the workspace and even attending to the opposite dice. However this new behaviour overwrites the previous one. Nonetheless, we consider that retaining the emergent behaviours from curious exploration equips the agent with a beneficial talent set to study new duties extra rapidly. With the intention to examine this conjecture, we arrange an experiment to probe the utility of the self-discovered abilities.
We deal with randomly sampled snapshots from completely different phases of the curious exploration as auxiliary abilities in a modular studying framework  and measure how rapidly a brand new goal talent might be realized through the use of these auxiliaries. Within the case of the JACO arm, we set the goal activity to be “raise the purple dice” and use 5 randomly sampled self-discovered behaviours as auxiliaries. We examine the educational of this downstream activity to an SAC-X baseline  which makes use of a curriculum of reward capabilities to reward reaching and shifting the purple dice which finally facilitates to study lifting as nicely. We discover that even this straightforward setup for skill-reuse already hurries up the educational progress of the downstream activity commensurate with a hand designed reward curriculum. The outcomes counsel that the automated identification and retention of helpful rising behaviour from curious exploration is a fruitful avenue of future investigation in unsupervised reinforcement studying.