A Fast Adaptive Online Gradient Descent Algorithm in Over-Parameterized Neural Networks
In recent years, deep learning has dramatically improved state of the art in many practical applications. However, this utility is highly dependent on fine-tuning of hyperparameters, including learning rate, batch size, and network initialization. Although many first-order adaptive gradient algorith...
Saved in:
| Published in | Neural processing letters Vol. 55; no. 4; pp. 4641 - 4659 |
|---|---|
| Main Authors | , , |
| Format | Journal Article |
| Language | English |
| Published |
New York
Springer US
01.08.2023
Springer Nature B.V |
| Subjects | |
| Online Access | Get full text |
| ISSN | 1370-4621 1573-773X |
| DOI | 10.1007/s11063-022-11057-4 |
Cover
| Summary: | In recent years, deep learning has dramatically improved state of the art in many practical applications. However, this utility is highly dependent on fine-tuning of hyperparameters, including learning rate, batch size, and network initialization. Although many first-order adaptive gradient algorithms (e.g., Adam, AdaGrad) have been proposed to adjust the learning rate, they are vulnerable to the initial learning rate and network structure in the training over-parameterized models, especially in the dynamic online setting. Therefore, the main challenge of using deep learning in practice is how to reduce the cost of tuning hyperparameters. To address this problem, we integrate the adaptive strategy of Radhakrishnan et al. and the acceleration strategy of Ghadimi et al. to propose a fast adaptive online gradient algorithm, FAOGD. The adaptive strategy we adopt only adjusts the learning rate according to the historical gradient and training loss value, while the acceleration strategy is the heavy-ball momentum used to accelerate the training of deep models. The proposed FAOGD enjoys merit that there is no need to tune hyperparameters related to the learning rate, which thus saves much unnecessary computational overhead. It is also shown that FAOGD can obtain the regret bound of
O
T
, matching the Adam and AdaGrad using the empirical learning rate. Simulation results in the over-parameterized neural networks clearly show that FAOGD outperforms existing algorithms. Furthermore, FAOGD is also robust to network structures and batch size. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 1370-4621 1573-773X |
| DOI: | 10.1007/s11063-022-11057-4 |