WhisperNet: Deep Siamese Network For Emotion and Speech Tempo Invariant Visual-Only Lip-Based Biometric

In the recent decade, the field of biometrics was revolutionized thanks to the rise of deep learning. Many improvements were done on old biometric methods which reduced the security concerns. Before biometric people verification methods like facial recognition, an imposter could access people's...

Full description

Saved in:

Bibliographic Details
Published in	2021 7th International Conference on Signal Processing and Intelligent Systems (ICSPIS) pp. 1 - 5
Main Authors	Zakeri, Abdollah, Hassanpour, Hamid
Format	Conference Proceeding
Language	English
Published	IEEE 29.12.2021
Subjects	Authentication Biometrics Biometrics (access control) Deep learning Deep Siamese Network Face recognition Lip-Based Biometrics Privacy Signal processing Video Processing Visualization
Online Access	Get full text
DOI	10.1109/ICSPIS54653.2021.9729394

Cover

Abstract	In the recent decade, the field of biometrics was revolutionized thanks to the rise of deep learning. Many improvements were done on old biometric methods which reduced the security concerns. Before biometric people verification methods like facial recognition, an imposter could access people's vital information simply by finding out their password via installing a key-logger on their system. Thanks to deep learning, safer biometric approaches to person verification and person re-identification like visual authentication and audio-visual authentication were made possible and applicable on many devices like smartphones and laptops. Unfortunately, facial recognition is considered to be a threat to personal privacy by some people. Additionally, biometric methods that use the audio modality are not always applicable due to reasons like audio noise present in the environment. Lip-based biometric authentication (LBBA) is the process of authenticating a person using a video of their lips' movement while talking. In order to solve the mentioned concerns about other biometric authentication methods, we can use a visual-only LBBA method. Since people might have different emotional states that could potentially affect their utterance and speech tempo, the audio-only LBBA method must be able to produce an emotional and speech tempo invariant embedding of the input utterance video. In this article, we proposed a network inspired by the Siamese architecture that learned to produce emotion and speech tempo invariant representations of the input utterance videos. In order to train and test our proposed network, we used the CREMA-D dataset and achieved 95.41 % accuracy on the validation set.
AbstractList	In the recent decade, the field of biometrics was revolutionized thanks to the rise of deep learning. Many improvements were done on old biometric methods which reduced the security concerns. Before biometric people verification methods like facial recognition, an imposter could access people's vital information simply by finding out their password via installing a key-logger on their system. Thanks to deep learning, safer biometric approaches to person verification and person re-identification like visual authentication and audio-visual authentication were made possible and applicable on many devices like smartphones and laptops. Unfortunately, facial recognition is considered to be a threat to personal privacy by some people. Additionally, biometric methods that use the audio modality are not always applicable due to reasons like audio noise present in the environment. Lip-based biometric authentication (LBBA) is the process of authenticating a person using a video of their lips' movement while talking. In order to solve the mentioned concerns about other biometric authentication methods, we can use a visual-only LBBA method. Since people might have different emotional states that could potentially affect their utterance and speech tempo, the audio-only LBBA method must be able to produce an emotional and speech tempo invariant embedding of the input utterance video. In this article, we proposed a network inspired by the Siamese architecture that learned to produce emotion and speech tempo invariant representations of the input utterance videos. In order to train and test our proposed network, we used the CREMA-D dataset and achieved 95.41 % accuracy on the validation set.
Author	Hassanpour, Hamid Zakeri, Abdollah
Author_xml	– sequence: 1 givenname: Abdollah surname: Zakeri fullname: Zakeri, Abdollah email: a.zakeri@shahroodut.ac.ir organization: Shahrood University of Technology,Faculty of Computer Engineering – sequence: 2 givenname: Hamid surname: Hassanpour fullname: Hassanpour, Hamid email: h.hassanpour@shahroodut.ac.ir organization: Shahrood University of Technology,Faculty of Computer Engineering
BookMark	eNotj91KwzAYQCPohc49gTffC3Tmp0ka79zctDCc0KmXI22-uuCalLQqe3sFd3XgXBw4V-Q8xICEAKMzxqi5LRfVS1nJXEkx45SzmdHcCJOfkanRBVNK5tSIIr8kH-97P_SYnnG8gwfEHipvOxwQ_sxPTJ-wigmWXRx9DGCDg6pHbPawxa6PUIZvm7wNI7z54csesk04HGHt-2xuB3Qw97HDMfnmmly09jDg9MQJeV0tt4unbL15LBf368wzVoyZE1K36KhTyDkTQtNCcye1ZYpLWctacG1a1bBC1m3BBerGcSu5skq3HKmYkJv_rkfEXZ98Z9Nxd9oXv-IOVTY
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/ICSPIS54653.2021.9729394
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Xplore IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Xplore url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9781665409384 166540938X
EndPage	5
ExternalDocumentID	9729394
Genre	orig-research
GroupedDBID	6IE 6IL CBEJK RIE RIL
ID	FETCH-LOGICAL-i118t-d357fed0d6e2213370872d57a16255b5b3279f6c185bf823e7cd2a526a67f2e03
IEDL.DBID	RIE
IngestDate	Thu Jun 29 18:37:26 EDT 2023
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i118t-d357fed0d6e2213370872d57a16255b5b3279f6c185bf823e7cd2a526a67f2e03
PageCount	5
ParticipantIDs	ieee_primary_9729394
PublicationCentury	2000
PublicationDate	2021-Dec.-29
PublicationDateYYYYMMDD	2021-12-29
PublicationDate_xml	– month: 12 year: 2021 text: 2021-Dec.-29 day: 29
PublicationDecade	2020
PublicationTitle	2021 7th International Conference on Signal Processing and Intelligent Systems (ICSPIS)
PublicationTitleAbbrev	ICSPIS
PublicationYear	2021
Publisher	IEEE
Publisher_xml	– name: IEEE
Score	1.8453361
Snippet	In the recent decade, the field of biometrics was revolutionized thanks to the rise of deep learning. Many improvements were done on old biometric methods...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Authentication Biometrics Biometrics (access control) Deep learning Deep Siamese Network Face recognition Lip-Based Biometrics Privacy Signal processing Video Processing Visualization
Title	WhisperNet: Deep Siamese Network For Emotion and Speech Tempo Invariant Visual-Only Lip-Based Biometric
URI	https://ieeexplore.ieee.org/document/9729394
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwGG2Qkyc1YPydHjxaGG3Xbh5BCBhBEkC5kbX9kEUyCA4T_ettN8BoPHhrliZdvqZ5fdt770PoOgCfc8UDoqkRhIPyScSYIFrVqGHK4xKc37nbE-0Rvx_74wK62XlhACATn0HFDbN_-Wah1-5TWTW0N0EW8j20JwORe7W24hwvrHYag35n4Jp7M8v7aK2ymf6jb0oGG60D1N0umKtFXivrVFX0568sxv--0SEqfxv0cH8HPUeoAEkJvTzPYpf73YP0Ft8BLPEgdhpYwL1c641bixVu5n17cJQYPFgC6Bkeungq3EneLW-2hcZP8ds6mpPHZP6BH-IlqVukM7jujPouz7-MRq3msNEmmz4KJLb0ISWG-XIKxjMCKLWcVHqBpMaXUc2SH1_5ilEZToW20K2mAWUgtaGRT0Uk5JSCx45RMVkkcIIw1_aGYvl0oEzA7fFXVGqXUGZcDJ7g3ikquSJNlnlUxmRTn7O_H5-jfbdRTh1CwwtUTFdruLQYn6qrbHO_AAaap1w
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NT8IwHG0QD3pSA8Zve_BoYfRj3TyCEFBAE_DjRtb2hyySQXCY6F9vywCj8eBtWbJ0-TXp62vfez-ELgIQnCseEE2NTzgoQSLGfKJVhRqmPC7B-Z07Xb_5wG-exXMOXa69MACwEJ9ByT0u7vLNRM_dUVk5tDtBFvINtGkH4CJza63kOV5YbtV6962ea-_NLPOjldLygx-dUxbA0dhBndWQmV7ktTRPVUl__kpj_O8_7aLit0UP36_BZw_lICmgl6dR7JK_u5Be4WuAKe7FTgULuJupvXFjMsP1rHMPjhKDe1MAPcJ9F1CFW8m7Zc621PgxfptHY3KXjD9wO56SqsU6g6vOqu8S_YvooVHv15pk2UmBxJZApMQwIYdgPOMDpZaVSi-Q1AgZVSz9EUooRmU49LUFbzUMKAOpDY0E9SNfDil4bB_lk0kCBwhzbfcollEHygTcLgCKSu0yyowLwvO5d4gKrkiDaRaWMVjW5-jv1-doq9nvtAftVvf2GG27SXNaERqeoHw6m8OpRfxUnS0m-gv5pqqp
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2021+7th+International+Conference+on+Signal+Processing+and+Intelligent+Systems+%28ICSPIS%29&rft.atitle=WhisperNet%3A+Deep+Siamese+Network+For+Emotion+and+Speech+Tempo+Invariant+Visual-Only+Lip-Based+Biometric&rft.au=Zakeri%2C+Abdollah&rft.au=Hassanpour%2C+Hamid&rft.date=2021-12-29&rft.pub=IEEE&rft.spage=1&rft.epage=5&rft_id=info:doi/10.1109%2FICSPIS54653.2021.9729394&rft.externalDocID=9729394