Actor-Expert: A Framework for using Q-learning in Continuous Action Spaces Article Swipe

View

Related Concepts

Action (physics) Parameterized complexity Computer science Benchmark (surveying) Artificial intelligence Competitor analysis Value (mathematics) Mathematical optimization Machine learning Mathematics Algorithm Economics Quantum mechanics Management Geodesy Geography Physics

Sungsu Lim , Ajin George Joseph , Lei Le , Yangchen Pan , Martha White ·

YOU? · · 2019 · Open Access · · DOI: https://doi.org/10.7939/r3-qgdp-3872 · OA: W2941168284

Q-learning can be difficult to use in continuous action spaces, because a difficult optimization has to be solved to find the maximal action. Some common strategies have been to discretize the action space, solve the maximization with a powerful optimizer at each step, restrict the functional form of the action-values, or optimize a different entropy-regularized objective to learn a policy proportional to action-values. Such methods however, can prevent learning accurate action-values, be expensive to execute at each step, or find a potentially suboptimal policy. In this thesis, we propose a new policy search objective that facilitates using Q-learning and a new framework called Actor-Expert, that optimizes this objective. The Expert uses approximate Q-learning to update the action-values towards optimal action-values. The Actor iteratively learns the maximal actions over time for these changing action-values. We develop a Conditional Cross Entropy Method (CCEM) for the Actor, where such a global optimization approach facilitates use of generically parameterized action-values (Expert) with a separate policy (Actor). This method iteratively concentrates density around maximal actions, conditioned on state. We demonstrate in a toy environment that Actor-Expert with unrestricted action-value parameterization and efficient exploration mechanism succeeds while previous Q-learning methods fail. We also demonstrate that Actor-Expert performs as well as or better than previous Q-learning methods on benchmark continuous-action environments. We also show that it is comparable against Actor-Critic baselines, suggesting a new distinction among methods that learn both value function and policy: learning action-values of the current policy or (optimal) action-values decoupled from the policy.