Constrained Expectation-Maximization Methods for Effective Reinforcement Learning


Recent advancement on reinforcement learning (RL) algorithms shows that effective learning of parametric action- selection policies can often be achieved through direct optimization of a performance lower bound subject to pre-defined policy behavioral constraints. Driven by this understanding, this paper seeks to develop new policy search techniques where RL is achieved through maximizing a performance lower bound obtained originally based on an Expectation-Maximization method. For reliable RL, our new learning techniques must also simultaneously guarantee constrained policy behavioral changes measured through KL divergence. Two separate approaches will be pursued to tackle our constrained policy optimization problems, resulting in two new RL algorithms. The first algorithm utilizes a conjugate gradient technique and a Bayesian learning method for approximate optimization. The second algorithm focuses on minimizing a loss function derived from solving the Lagrangian for constrained policy search. Both algorithms have been experimentally examined on several benchmark problems provided by OpenAI GYM. The experiment results clearly demonstrate that our algorithms can be highly effective in comparison to several well-known RL algorithms.