Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions

Classical stochastic gradient methods are well suited for minimizing expected-value objective functions. However, they do not apply to the minimization of a nonlinear function involving expected values or a composition of two expected-value functions, i.e., the problem min x E v f v ( E w [ g w ( x...

Full description

Saved in:

Bibliographic Details
Published in	Mathematical programming Vol. 161; no. 1-2; pp. 419 - 449
Main Authors	Wang, Mengdi, Fang, Ethan X., Liu, Han
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.01.2017 Springer Nature B.V
Subjects	Algorithms Analysis Approximation Calculus of Variations and Optimal Control; Optimization Combinatorics Convergence Convex analysis Descent Distance learning Dynamic programming Expected values Full Length Paper Functions (mathematics) Mathematical analysis Mathematical and Computational Physics Mathematical Methods in Physics Mathematical programming Mathematics Mathematics and Statistics Mathematics of Computing Numerical Analysis Operations research Optimization Optimization techniques Random variables Simulation Stochastic models Stochasticity Studies Theoretical Simulation Convex optimization Statistical learning 90C25 90C06 68W27 90C15 Sample complexity Stochastic optimization Stochastic gradient
Online Access	Get full text
ISSN	0025-5610 1436-4646
DOI	10.1007/s10107-016-1017-3

Cover

More Information
Summary:	Classical stochastic gradient methods are well suited for minimizing expected-value objective functions. However, they do not apply to the minimization of a nonlinear function involving expected values or a composition of two expected-value functions, i.e., the problem min x E v f v ( E w [ g w ( x ) ] ) . In order to solve this stochastic composition problem, we propose a class of stochastic compositional gradient descent (SCGD) algorithms that can be viewed as stochastic versions of quasi-gradient method. SCGD update the solutions based on noisy sample gradients of f v , g w and use an auxiliary variable to track the unknown quantity E w g w ( x ) . We prove that the SCGD converge almost surely to an optimal solution for convex optimization problems, as long as such a solution exists. The convergence involves the interplay of two iterations with different time scales. For nonsmooth convex problems, the SCGD achieves a convergence rate of O ( k - 1 / 4 ) in the general case and O ( k - 2 / 3 ) in the strongly convex case, after taking k samples. For smooth convex problems, the SCGD can be accelerated to converge at a rate of O ( k - 2 / 7 ) in the general case and O ( k - 4 / 5 ) in the strongly convex case. For nonconvex problems, we prove that any limit point generated by SCGD is a stationary point, for which we also provide the convergence rate analysis. Indeed, the stochastic setting where one wants to optimize compositions of expected-value functions is very common in practice. The proposed SCGD methods find wide applications in learning, estimation, dynamic programming, etc.
Bibliography:	SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-1 ObjectType-Feature-2 content type line 23
ISSN:	0025-5610 1436-4646
DOI:	10.1007/s10107-016-1017-3