A Sandpile Model for Reliable Actor-Critic Reinforcement Learning


Actor-Critic algorithms have been increasingly re- searched for tackling challenging reinforcement learning problems. These algorithms are usually composed of two distinct learning processes, namely actor (a.k.a, policy) learning and critic (a.k.a, value function) learning. Actor learning is heavily dependent on critic learning; particularly unreliable critic learning due to its divergence can significantly affect the effectiveness of actor-critic algorithms. To address this issue, many successful algorithms have been developed recently with the aim of im- proving the accuracy of value function approximation. However, these algorithms introduce extra complexities to the learning process and may actually increase the difficulty for effective learning. Thus, in this research, we consider a simpler approach to improving the critic learning reliability. This approach requires us to seamlessly integrate an adapted Sandpile Model with the critic learning process so as to achieve desirable self-organizing property for reliable critic learning. Following this approach, we propose a new actor-critic learning algorithm. Its effectiveness and learning reliability have been further evaluated experimentally. As strongly demonstrated in the experiment results, our new algorithm can perform much better than traditional actor-critic algorithms. Meanwhile, correlation analysis further suggests that a strong correlation exists in between learning reliability and effectiveness. This finding may be important for future development of powerful reinforcement learning algorithms.