Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions
Classical stochastic gradient methods are well suited for minimizing expected-value objective functions. However, they do not apply to the minimization of a nonlinear function involving expected values or a composition of two expected-value functions, i.e., the problem min x E v f v ( E w [ g w ( x...
Saved in:
| Published in | Mathematical programming Vol. 161; no. 1-2; pp. 419 - 449 |
|---|---|
| Main Authors | , , |
| Format | Journal Article |
| Language | English |
| Published |
Berlin/Heidelberg
Springer Berlin Heidelberg
01.01.2017
Springer Nature B.V |
| Subjects | |
| Online Access | Get full text |
| ISSN | 0025-5610 1436-4646 |
| DOI | 10.1007/s10107-016-1017-3 |
Cover
| Summary: | Classical stochastic gradient methods are well suited for minimizing expected-value objective functions. However, they do not apply to the minimization of a nonlinear function involving expected values or a composition of two expected-value functions, i.e., the problem
min
x
E
v
f
v
(
E
w
[
g
w
(
x
)
]
)
.
In order to solve this stochastic composition problem, we propose a class of stochastic compositional gradient descent (SCGD) algorithms that can be viewed as stochastic versions of quasi-gradient method. SCGD update the solutions based on noisy sample gradients of
f
v
,
g
w
and use an auxiliary variable to track the unknown quantity
E
w
g
w
(
x
)
. We prove that the SCGD converge almost surely to an optimal solution for convex optimization problems, as long as such a solution exists. The convergence involves the interplay of two iterations with different time scales. For nonsmooth convex problems, the SCGD achieves a convergence rate of
O
(
k
-
1
/
4
)
in the general case and
O
(
k
-
2
/
3
)
in the strongly convex case, after taking
k
samples. For smooth convex problems, the SCGD can be accelerated to converge at a rate of
O
(
k
-
2
/
7
)
in the general case and
O
(
k
-
4
/
5
)
in the strongly convex case. For nonconvex problems, we prove that any limit point generated by SCGD is a stationary point, for which we also provide the convergence rate analysis. Indeed, the stochastic setting where one wants to optimize compositions of expected-value functions is very common in practice. The proposed SCGD methods find wide applications in learning, estimation, dynamic programming, etc. |
|---|---|
| Bibliography: | SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-1 ObjectType-Feature-2 content type line 23 |
| ISSN: | 0025-5610 1436-4646 |
| DOI: | 10.1007/s10107-016-1017-3 |