A distributional code for value in dopamine-based reinforcement learning

Since its introduction, the reward prediction error theory of dopamine has explained a wealth of empirical phenomena, providing a unifying framework for understanding the representation of reward and value in the brain 1 – 3 . According to the now canonical theory, reward predictions are represented...

Full description

Saved in:
Bibliographic Details
Published inNature (London) Vol. 577; no. 7792; pp. 671 - 675
Main Authors Dabney, Will, Kurth-Nelson, Zeb, Uchida, Naoshige, Starkweather, Clara Kwon, Hassabis, Demis, Munos, Rémi, Botvinick, Matthew
Format Journal Article
LanguageEnglish
Published London Nature Publishing Group UK 30.01.2020
Nature Publishing Group
Subjects
Online AccessGet full text
ISSN0028-0836
1476-4687
1476-4687
DOI10.1038/s41586-019-1924-6

Cover

More Information
Summary:Since its introduction, the reward prediction error theory of dopamine has explained a wealth of empirical phenomena, providing a unifying framework for understanding the representation of reward and value in the brain 1 – 3 . According to the now canonical theory, reward predictions are represented as a single scalar quantity, which supports learning about the expectation, or mean, of stochastic outcomes. Here we propose an account of dopamine-based reinforcement learning inspired by recent artificial intelligence research on distributional reinforcement learning 4 – 6 . We hypothesized that the brain represents possible future rewards not as a single mean, but instead as a probability distribution, effectively representing multiple future outcomes simultaneously and in parallel. This idea implies a set of empirical predictions, which we tested using single-unit recordings from mouse ventral tegmental area. Our findings provide strong evidence for a neural realization of distributional reinforcement learning. Analyses of single-cell recordings from mouse ventral tegmental area are consistent with a model of reinforcement learning in which the brain represents possible future rewards not as a single mean of stochastic outcomes, as in the canonical model, but instead as a probability distribution.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
Equal contributions
W.D. conceived the project. W.D., Z.K., and M.B. contributed ideas for experiments and analysis. W.D. and Z.K. performed simulation experiments and analysis. N.U. and C.S provided neuronal data for analysis. W.D., Z.K., and M.B. managed the project. M.B., N.U., R.M., and D.H advised on the project. M.B., W.D., and Z.K wrote the paper. W.D., Z.K., M.B., N.U., C.S., D.H., R.M. provided revisions to the paper.
Author Contributions
ISSN:0028-0836
1476-4687
1476-4687
DOI:10.1038/s41586-019-1924-6