Reliable adaptive distributed hyperparameter optimization (RadHPO) for deep learning training and uncertainty estimation
Training and validation of Neural Networks (NN) are very computationally intensive. In this paper, we propose a distributed system based NN infrastructure that achieves two goals: to accelerate model training, specifically for hyperparameter optimization, and to re-use some of these intermediate mod...
        Saved in:
      
    
          | Published in | The Journal of supercomputing Vol. 79; no. 10; pp. 10677 - 10690 | 
|---|---|
| Main Authors | , , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
        New York
          Springer US
    
        01.07.2023
     Springer Nature B.V  | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 0920-8542 1573-0484  | 
| DOI | 10.1007/s11227-023-05081-x | 
Cover
| Summary: | Training and validation of Neural Networks (NN) are very computationally intensive. In this paper, we propose a distributed system based NN infrastructure that achieves two goals: to accelerate model training, specifically for hyperparameter optimization, and to re-use some of these intermediate models to evaluate the uncertainty of the model. By accelerating model training, we can obtain a large set of potential models and compare them in a shorter amount of time. Automating this process reduces development time and provides an easy way to compare different models. Our application runs different models on distinct servers with a single training data set, each with tweaked hyperparameters. By adding uncertainty to our results, our framework provides not just a single prediction but a distribution over predictions. Adding uncertainty is essential to some NN applications since most models assume that the input data distributions are identical between test and validation. However, in reality, they are producing some catastrophic mistakes. Since our solution is a distributed system, we make our implementation robust to common distributed system failures (servers going down, loss of communication among some nodes, and others). Furthermore, we use a gossip-style heartbeat protocol for failure detection and recovery. Finally, some preliminary results using a black-box approach to generate the training models show that our infrastructure scales well in different hardware platforms. | 
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14  | 
| ISSN: | 0920-8542 1573-0484  | 
| DOI: | 10.1007/s11227-023-05081-x |