A multi-tasking model of speaker-keyword classification for keeping human in the loop of drone-assisted inspection

Audio commands are a preferred communication medium to keep inspectors in the loop of civil infrastructure inspection performed by a semi-autonomous drone. To understand job-specific commands from a group of heterogeneous and dynamic inspectors, a model must be developed cost-effectively for the gro...

Full description

Saved in:

Bibliographic Details
Published in	Engineering applications of artificial intelligence Vol. 117; p. 105597
Main Authors	Li, Yu, Parsan, Anisha, Wang, Bill, Dong, Penghao, Yao, Shanshan, Qin, Ruwen
Format	Journal Article
Language	English
Published	Elsevier Ltd 01.01.2023
Subjects	Human-in-the-loop Human–robot interaction Infrastructure inspection Keyword classification Speaker recognition Infrastructure inspection Human–robot interaction Speaker recognition Keyword classification Human-in-the-loop
Online Access	Get full text
ISSN	0952-1976 1873-6769
DOI	10.1016/j.engappai.2022.105597

Cover

More Information
Summary:	Audio commands are a preferred communication medium to keep inspectors in the loop of civil infrastructure inspection performed by a semi-autonomous drone. To understand job-specific commands from a group of heterogeneous and dynamic inspectors, a model must be developed cost-effectively for the group and easily adapted when the group changes. This paper is motivated to build a multi-tasking deep learning model that possesses a Share–Split–Collaborate architecture. This architecture allows the two classification tasks to share the feature extractor and then split subject-specific and keyword-specific features intertwined in the extracted features through feature projection and collaborative training. A base model for a group of five authorized subjects is trained and tested on the inspection keyword dataset collected by this study. The model achieved a 95.3% or higher mean accuracy in classifying the keywords of any authorized inspectors. Its mean accuracy in speaker classification is 99.2%. Due to the richer keyword representations that the model learns from the pooled training data Adapting the base model to a new inspector requires only a little training data from that inspector Like five utterances per keyword. Using the speaker classification scores for inspector verification can achieve a success rate of at least 93.9% in verifying authorized inspectors and 76.1% in detecting unauthorized ones. Further The paper demonstrates the applicability of the proposed model to larger-size groups on a public dataset. This paper provides a solution to addressing challenges facing AI-assisted human–robot interaction Including worker heterogeneity Worker dynamics And job heterogeneity. •The Share–Split–Collaborate multitask learning architecture is suitable for speaker-keyword classification.•Subject-specific and phonetic-specific features intertwined in audio data can be disentangled.•Rich keyword representations are learned from multi-subject spoken command data.•Small data of new speakers are sufficient for adding new classes to the speaker classifier.•Speaker classification scores are also effective for the speaker verification.
ISSN:	0952-1976 1873-6769
DOI:	10.1016/j.engappai.2022.105597