A Fast Adaptive Online Gradient Descent Algorithm in Over-Parameterized Neural Networks

In recent years, deep learning has dramatically improved state of the art in many practical applications. However, this utility is highly dependent on fine-tuning of hyperparameters, including learning rate, batch size, and network initialization. Although many first-order adaptive gradient algorith...

Full description

Saved in:
Bibliographic Details
Published inNeural processing letters Vol. 55; no. 4; pp. 4641 - 4659
Main Authors Yang, Anni, Li, Dequan, Li, Guangxiang
Format Journal Article
LanguageEnglish
Published New York Springer US 01.08.2023
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN1370-4621
1573-773X
DOI10.1007/s11063-022-11057-4

Cover

More Information
Summary:In recent years, deep learning has dramatically improved state of the art in many practical applications. However, this utility is highly dependent on fine-tuning of hyperparameters, including learning rate, batch size, and network initialization. Although many first-order adaptive gradient algorithms (e.g., Adam, AdaGrad) have been proposed to adjust the learning rate, they are vulnerable to the initial learning rate and network structure in the training over-parameterized models, especially in the dynamic online setting. Therefore, the main challenge of using deep learning in practice is how to reduce the cost of tuning hyperparameters. To address this problem, we integrate the adaptive strategy of Radhakrishnan et al. and the acceleration strategy of Ghadimi et al. to propose a fast adaptive online gradient algorithm, FAOGD. The adaptive strategy we adopt only adjusts the learning rate according to the historical gradient and training loss value, while the acceleration strategy is the heavy-ball momentum used to accelerate the training of deep models. The proposed FAOGD enjoys merit that there is no need to tune hyperparameters related to the learning rate, which thus saves much unnecessary computational overhead. It is also shown that FAOGD can obtain the regret bound of O T , matching the Adam and AdaGrad using the empirical learning rate. Simulation results in the over-parameterized neural networks clearly show that FAOGD outperforms existing algorithms. Furthermore, FAOGD is also robust to network structures and batch size.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1370-4621
1573-773X
DOI:10.1007/s11063-022-11057-4