TU Darmstadt / ULB / TUprints

Reinforcement Learning with Non-Exponential Discounting

Schultheis, Matthias ; Rothkopf, Constantin A. ; Koeppl, Heinz
eds.: Koyejo, S. ; Mohamed, S. ; Agarwal, A. ; Belgrave, D. ; Cho, K. ; Oh, A. (2025)
Reinforcement Learning with Non-Exponential Discounting.
The Thirty-Sixth Annual Conference on Neural Information Processing Systems. New Orleans ; Virtual Conference (28.11.2022 - 09.12.2022)
doi: 10.26083/tuprints-00028934
Conference or Workshop Item, Secondary publication, Publisher's Version

[img] Text
NeurIPS-2022-reinforcement-learning-with-non-exponential-discounting-Paper-Conference.pdf
Copyright Information: CC BY 4.0 International - Creative Commons, Attribution.

Download (2MB)
[img] Text (Supplement)
appendix.pdf
Copyright Information: CC BY 4.0 International - Creative Commons, Attribution.

Download (909kB)
Item Type: Conference or Workshop Item
Type of entry: Secondary publication
Title: Reinforcement Learning with Non-Exponential Discounting
Language: English
Date: 15 January 2025
Place of Publication: Darmstadt
Year of primary publication: 2022
Place of primary publication: San Diego, CA
Publisher: NeurIPS
Book Title: Advances in Neural Information Processing Systems 35 (NeurIPS 2022)
Collation: 14 Seiten
Event Title: The Thirty-Sixth Annual Conference on Neural Information Processing Systems
Event Location: New Orleans ; Virtual Conference
Event Dates: 28.11.2022 - 09.12.2022
DOI: 10.26083/tuprints-00028934
Corresponding Links:
Origin: Secondary publication service
Abstract:

Commonly in reinforcement learning (RL), rewards are discounted over time using an exponential function to model time preference, thereby bounding the expected long-term reward. In contrast, in economics and psychology, it has been shown that humans often adopt a hyperbolic discounting scheme, which is optimal when a specific task termination time distribution is assumed. In this work, we propose a theory for continuous-time model-based reinforcement learning generalized to arbitrary discount functions. This formulation covers the case in which there is a non-exponential random termination time. We derive a Hamilton–Jacobi–Bellman (HJB) equation characterizing the optimal policy and describe how it can be solved using a collocation method, which uses deep learning for function approximation. Further, we show how the inverse RL problem can be approached, in which one tries to recover properties of the discount function given decision data. We validate the applicability of our proposed approach on two simulated problems. Our approach opens the way for the analysis of human discounting in sequential decision-making tasks.

Status: Publisher's Version
URN: urn:nbn:de:tuda-tuprints-289347
Additional Information:

Mehr Supplements unter "Identischs Werk" verfügbar

Classification DDC: 500 Science and mathematics > 570 Life sciences, biology
600 Technology, medicine, applied sciences > 621.3 Electrical engineering, electronics
Divisions: 18 Department of Electrical Engineering and Information Technology > Institute for Telecommunications > Bioinspired Communication Systems
18 Department of Electrical Engineering and Information Technology > Self-Organizing Systems Lab
Zentrale Einrichtungen > Centre for Cognitive Science (CCS)
Date Deposited: 15 Jan 2025 09:15
Last Modified: 15 Jan 2025 09:15
URI: https://tuprints.ulb.tu-darmstadt.de/id/eprint/28934
PPN:
Export:
Actions (login required)
View Item View Item