Reliable adaptive distributed hyperparameter optimization (RadHPO) for deep learning training and uncertainty estimation

Training and validation of Neural Networks (NN) are very computationally intensive. In this paper, we propose a distributed system based NN infrastructure that achieves two goals: to accelerate model training, specifically for hyperparameter optimization, and to re-use some of these intermediate mod...

Full description

Saved in:

Bibliographic Details
Published in	The Journal of supercomputing Vol. 79; no. 10; pp. 10677 - 10690
Main Authors	Li, John, Pantoja, Maria, Fernández-Escribano, Gerardo
Format	Journal Article
Language	English
Published	New York Springer US 01.07.2023 Springer Nature B.V
Subjects	Accuracy Algorithms Compilers Computer Science Failure detection Infrastructure Interpreters Libraries Machine learning Neural networks Optimization Processor Architectures Programming Languages Python Servers System failures Uncertainty User interface Uncertainty Resiliency Distributed systems Machine learning Hyper-parameter Optimization
Online Access	Get full text
ISSN	0920-8542 1573-0484
DOI	10.1007/s11227-023-05081-x

Cover

More Information
Summary:	Training and validation of Neural Networks (NN) are very computationally intensive. In this paper, we propose a distributed system based NN infrastructure that achieves two goals: to accelerate model training, specifically for hyperparameter optimization, and to re-use some of these intermediate models to evaluate the uncertainty of the model. By accelerating model training, we can obtain a large set of potential models and compare them in a shorter amount of time. Automating this process reduces development time and provides an easy way to compare different models. Our application runs different models on distinct servers with a single training data set, each with tweaked hyperparameters. By adding uncertainty to our results, our framework provides not just a single prediction but a distribution over predictions. Adding uncertainty is essential to some NN applications since most models assume that the input data distributions are identical between test and validation. However, in reality, they are producing some catastrophic mistakes. Since our solution is a distributed system, we make our implementation robust to common distributed system failures (servers going down, loss of communication among some nodes, and others). Furthermore, we use a gossip-style heartbeat protocol for failure detection and recovery. Finally, some preliminary results using a black-box approach to generate the training models show that our infrastructure scales well in different hardware platforms.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0920-8542 1573-0484
DOI:	10.1007/s11227-023-05081-x