Logo des Repositoriums
  • English
  • Deutsch
Anmelden
Keine TU-ID? Klicken Sie hier für mehr Informationen.
  1. Startseite
  2. Publikationen
  3. Publikationen der Technischen Universität Darmstadt
  4. Zweitveröffentlichungen
  5. Entropic Regularization of Markov Decision Processes
 
  • Details
2019
Zweitveröffentlichung
Artikel
Verlagsversion

Entropic Regularization of Markov Decision Processes

File(s)
Download
Hauptpublikation
Belousov.entropy.pdf
CC BY 4.0 International
Format: Adobe PDF
Size: 650.6 KB
TUDa URI
tuda/4766
URN
urn:nbn:de:tuda-tuprints-92409
Autor:innen
Belousov, Boris
Peters, Jan
Kurzbeschreibung (Abstract)

An optimal feedback controller for a given Markov decision process (MDP) can in principle be synthesized by value or policy iteration. However, if the system dynamics and the reward function are unknown, a learning agent must discover an optimal controller via direct interaction with the environment. Such interactive data gathering commonly leads to divergence towards dangerous or uninformative regions of the state space unless additional regularization measures are taken. Prior works proposed bounding the information loss measured by the Kullback–Leibler (KL) divergence at every policy improvement step to eliminate instability in the learning dynamics. In this paper, we consider a broader family of f-divergences, and more concretely α-divergences, which inherit the beneficial property of providing the policy improvement step in closed form at the same time yielding a corresponding dual objective for policy evaluation. Such entropic proximal policy optimization view gives a unified perspective on compatible actor-critic architectures. In particular, common least-squares value function estimation coupled with advantage-weighted maximum likelihood policy improvement is shown to correspond to the Pearson χ 2 -divergence penalty. Other actor-critic pairs arise for various choices of the penalty-generating function f. On a concrete instantiation of our framework with the α-divergence, we carry out asymptotic analysis of the solutions for different values of α and demonstrate the effects of the divergence function choice on common standard reinforcement learning problems.

Sprache
Englisch
Fachbereich/-gebiet
20 Fachbereich Informatik > Intelligente Autonome Systeme
DDC
000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik
Institution
Universitäts- und Landesbibliothek Darmstadt
Ort
Darmstadt
Titel der Zeitschrift / Schriftenreihe
Entropy
Bandnummer der Reihe
21
Heftnummer der Zeitschrift
7
ISSN
1099-4300
Verlag
MDPI
Publikationsjahr der Erstveröffentlichung
2019
Verlags-DOI
10.3390/e21070674
PPN
45482193X

  • TUprints Leitlinien
  • Cookie-Einstellungen
  • Impressum
  • Datenschutzbestimmungen
  • Webseitenanalyse
Diese Webseite wird von der Universitäts- und Landesbibliothek Darmstadt (ULB) betrieben.