Song, Yunlong (2023)
Minimax and entropic proximal policy optimization.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00024754
Master Thesis, Primary publication, Publisher's Version
Text
yunlong_thesis.pdf Copyright Information: CC BY 4.0 International - Creative Commons, Attribution. Download (3MB) |
Item Type: | Master Thesis | ||||
---|---|---|---|---|---|
Type of entry: | Primary publication | ||||
Title: | Minimax and entropic proximal policy optimization | ||||
Language: | English | ||||
Referees: | Peters, Prof. Dr. Jan ; Koeppl, Prof. Dr. Heinz ; Belousov, Boris | ||||
Date: | 26 October 2023 | ||||
Place of Publication: | Darmstadt | ||||
Collation: | vi, 42 Seiten | ||||
DOI: | 10.26083/tuprints-00024754 | ||||
Abstract: | First-order gradient descent is to date the most commonly used optimization method for training deep neural networks, especially for networks with shared parameters, or recurrent neural networks (RNNs). Policy gradient methods provide several advantages over other reinforcement learning algorithms; for example, they can naturally handle continuous state and action spaces. In this thesis, we contribute two different policy gradient algorithms that are straightforward to implement and effective for solving challenging environments, both methods being compatible with large nonlinear function approximations and optimized using stochastic gradient descent. First, we propose a new family of policy gradient algorithms, which we call minimax entropic policy optimization (MMPO). The new method combines the trust region policy optimization and the idea of minimax training, in which stable policy improvement is achieved by formulating the KL-divergence constraint in the trust region policy optimization (TRPO) as a loss function with a ramp function transformation, and then, carrying out a minimax optimization between two stochastic gradient optimizers, one optimizing the "surrogate" objective and another maximizing the ramp-transformed KL-divergence loss function. Our experiments on several challenging continuous control tasks demonstrate that MMPO method achieves comparable performance as TRPO and proximal policy optimization (PPO), however, is much easier to implement compared to TRPO and guarantees that the KL-divergence bound to be satisfied. Second, we investigate the use of the f-divergence as a regularization to the policy improvement, where the f-divergence is a general class of functional measuring the divergence between two probability distributions with the KL-divergence being a special case. The f-divergence can be either treated as a hard constraint or added as a soft constraint to the objective. We propose to treat the f-divergence as a soft constraint by penalizing the policy update step via a penalty term on the f-divergence between successive policy distributions. We term such an unconstrained policy optimization method as f-divergence penalized policy optimization (f-PPO). We focus on a one-parameter family of α-divergences, a special case of f-divergences, and study influences of the choice of divergence functions on policy optimization. The empirical results on a series of MuJoCo environments show that f-PPO with a proper choice of α-divergence is effective for solving challenging continuous control tasks, where α-divergences act differently on the policy entropy, and hence, on the policy improvement. |
||||
Alternative Abstract: |
|
||||
Status: | Publisher's Version | ||||
URN: | urn:nbn:de:tuda-tuprints-247547 | ||||
Classification DDC: | 000 Generalities, computers, information > 004 Computer science | ||||
Divisions: | 20 Department of Computer Science > Intelligent Autonomous Systems | ||||
TU-Projects: | EC/H2020|640554|SKILLS4ROBOTS | ||||
Date Deposited: | 26 Oct 2023 13:43 | ||||
Last Modified: | 23 Jan 2024 09:20 | ||||
URI: | https://tuprints.ulb.tu-darmstadt.de/id/eprint/24754 | ||||
PPN: | 512797285 | ||||
Export: |
View Item |