Schultheis, Matthias ; Rothkopf, Constantin A. ; Koeppl, Heinz
eds.: Koyejo, S. ; Mohamed, S. ; Agarwal, A. ; Belgrave, D. ; Cho, K. ; Oh, A. (2025)
Reinforcement Learning with Non-Exponential Discounting.
The Thirty-Sixth Annual Conference on Neural Information Processing Systems. New Orleans ; Virtual Conference (28.11.2022 - 09.12.2022)
doi: 10.26083/tuprints-00028934
Conference or Workshop Item, Secondary publication, Publisher's Version
Text
NeurIPS-2022-reinforcement-learning-with-non-exponential-discounting-Paper-Conference.pdf Copyright Information: CC BY 4.0 International - Creative Commons, Attribution. Download (2MB) |
|
Text
(Supplement)
appendix.pdf Copyright Information: CC BY 4.0 International - Creative Commons, Attribution. Download (909kB) |
Item Type: | Conference or Workshop Item |
---|---|
Type of entry: | Secondary publication |
Title: | Reinforcement Learning with Non-Exponential Discounting |
Language: | English |
Date: | 15 January 2025 |
Place of Publication: | Darmstadt |
Year of primary publication: | 2022 |
Place of primary publication: | San Diego, CA |
Publisher: | NeurIPS |
Book Title: | Advances in Neural Information Processing Systems 35 (NeurIPS 2022) |
Collation: | 14 Seiten |
Event Title: | The Thirty-Sixth Annual Conference on Neural Information Processing Systems |
Event Location: | New Orleans ; Virtual Conference |
Event Dates: | 28.11.2022 - 09.12.2022 |
DOI: | 10.26083/tuprints-00028934 |
Corresponding Links: | |
Origin: | Secondary publication service |
Abstract: | Commonly in reinforcement learning (RL), rewards are discounted over time using an exponential function to model time preference, thereby bounding the expected long-term reward. In contrast, in economics and psychology, it has been shown that humans often adopt a hyperbolic discounting scheme, which is optimal when a specific task termination time distribution is assumed. In this work, we propose a theory for continuous-time model-based reinforcement learning generalized to arbitrary discount functions. This formulation covers the case in which there is a non-exponential random termination time. We derive a Hamilton–Jacobi–Bellman (HJB) equation characterizing the optimal policy and describe how it can be solved using a collocation method, which uses deep learning for function approximation. Further, we show how the inverse RL problem can be approached, in which one tries to recover properties of the discount function given decision data. We validate the applicability of our proposed approach on two simulated problems. Our approach opens the way for the analysis of human discounting in sequential decision-making tasks. |
Status: | Publisher's Version |
URN: | urn:nbn:de:tuda-tuprints-289347 |
Additional Information: | Mehr Supplements unter "Identischs Werk" verfügbar |
Classification DDC: | 500 Science and mathematics > 570 Life sciences, biology 600 Technology, medicine, applied sciences > 621.3 Electrical engineering, electronics |
Divisions: | 18 Department of Electrical Engineering and Information Technology > Institute for Telecommunications > Bioinspired Communication Systems 18 Department of Electrical Engineering and Information Technology > Self-Organizing Systems Lab Zentrale Einrichtungen > Centre for Cognitive Science (CCS) |
Date Deposited: | 15 Jan 2025 09:15 |
Last Modified: | 15 Jan 2025 09:15 |
URI: | https://tuprints.ulb.tu-darmstadt.de/id/eprint/28934 |
PPN: | |
Export: |
View Item |