Risk-Sensitive Reinforcement Learning

We derive a family of risk-sensitive reinforcement learning methods for agents, who face sequential decision-making tasks in uncertain environments. By applying a utility function to the temporal difference (TD) error, nonlinear transformations are effectively applied not only to the received reward...

Full description

Saved in:

Bibliographic Details
Published in	Neural computation Vol. 26; no. 7; pp. 1298 - 1328
Main Authors	Shen, Yun, Tobia, Michael J., Sommer, Tobias, Obermayer, Klaus
Format	Journal Article
Language	English
Published	One Rogers Street, Cambridge, MA 02142-1209, USA MIT Press 01.07.2014 MIT Press Journals, The
Subjects	Algorithms Behavior Brain - physiology Brain Mapping Decision making Decision Making - physiology Decision making models Human behavior Humans Learning Letters Magnetic Resonance Imaging Markov analysis Markov Chains Mathematical analysis Mathematical models Models, Psychological Neuropsychology Nonlinear Dynamics Oxygen - blood Probability Reinforcement Reinforcement (Psychology) Risk Signal transduction Tasks Transition probabilities Utilities Utility functions
Online Access	Get full text
ISSN	0899-7667 1530-888X 1530-888X
DOI	10.1162/NECO_a_00600

Cover

More Information
Summary:	We derive a family of risk-sensitive reinforcement learning methods for agents, who face sequential decision-making tasks in uncertain environments. By applying a utility function to the temporal difference (TD) error, nonlinear transformations are effectively applied not only to the received rewards but also to the true transition probabilities of the underlying Markov decision process. When appropriate utility functions are chosen, the agents’ behaviors express key features of human behavior as predicted by prospect theory (Kahneman & Tversky, ), for example, different risk preferences for gains and losses, as well as the shape of subjective probability curves. We derive a risk-sensitive Q-learning algorithm, which is necessary for modeling human behavior when transition probabilities are unknown, and prove its convergence. As a proof of principle for the applicability of the new framework, we apply it to quantify human behavior in a sequential investment task. We find that the risk-sensitive variant provides a significantly better fit to the behavioral data and that it leads to an interpretation of the subject's responses that is indeed consistent with prospect theory. The analysis of simultaneously measured fMRI signals shows a significant correlation of the risk-sensitive TD error with BOLD signal change in the ventral striatum. In addition we find a significant correlation of the risk-sensitive Q-values with neural activity in the striatum, cingulate cortex, and insula that is not present if standard Q-values are used.
Bibliography:	July, 2014 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 content type line 23 ObjectType-Correspondence-1 ObjectType-Article-1 ObjectType-Feature-2
ISSN:	0899-7667 1530-888X 1530-888X
DOI:	10.1162/NECO_a_00600