Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions

Classical stochastic gradient methods are well suited for minimizing expected-value objective functions. However, they do not apply to the minimization of a nonlinear function involving expected values or a composition of two expected-value functions, i.e., the problem min x E v f v ( E w [ g w ( x...

Full description

Saved in:
Bibliographic Details
Published inMathematical programming Vol. 161; no. 1-2; pp. 419 - 449
Main Authors Wang, Mengdi, Fang, Ethan X., Liu, Han
Format Journal Article
LanguageEnglish
Published Berlin/Heidelberg Springer Berlin Heidelberg 01.01.2017
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN0025-5610
1436-4646
DOI10.1007/s10107-016-1017-3

Cover

More Information
Summary:Classical stochastic gradient methods are well suited for minimizing expected-value objective functions. However, they do not apply to the minimization of a nonlinear function involving expected values or a composition of two expected-value functions, i.e., the problem min x E v f v ( E w [ g w ( x ) ] ) . In order to solve this stochastic composition problem, we propose a class of stochastic compositional gradient descent (SCGD) algorithms that can be viewed as stochastic versions of quasi-gradient method. SCGD update the solutions based on noisy sample gradients of f v , g w and use an auxiliary variable to track the unknown quantity E w g w ( x ) . We prove that the SCGD converge almost surely to an optimal solution for convex optimization problems, as long as such a solution exists. The convergence involves the interplay of two iterations with different time scales. For nonsmooth convex problems, the SCGD achieves a convergence rate of O ( k - 1 / 4 ) in the general case and O ( k - 2 / 3 ) in the strongly convex case, after taking k samples. For smooth convex problems, the SCGD can be accelerated to converge at a rate of O ( k - 2 / 7 ) in the general case and O ( k - 4 / 5 ) in the strongly convex case. For nonconvex problems, we prove that any limit point generated by SCGD is a stationary point, for which we also provide the convergence rate analysis. Indeed, the stochastic setting where one wants to optimize compositions of expected-value functions is very common in practice. The proposed SCGD methods find wide applications in learning, estimation, dynamic programming, etc.
Bibliography:SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 14
ObjectType-Article-1
ObjectType-Feature-2
content type line 23
ISSN:0025-5610
1436-4646
DOI:10.1007/s10107-016-1017-3