Audio-visual speech enhancement using deep neural networks

This paper proposes a novel framework that integrates audio and visual information for speech enhancement. Most speech enhancement approaches consider audio features only to design filters or transfer functions to convert noisy speech signals to clean ones. Visual data, which provide useful compleme...

Full description

Saved in:

Bibliographic Details
Published in	2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) pp. 1 - 6
Main Authors	Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Jen-Chun Lin, Yu Tsao, Hsiu-Wen Chang, Hsin-Min Wang
Format	Conference Proceeding
Language	English
Published	Asia Pacific Signal and Information Processing Association 01.12.2016
Subjects	Feature extraction Noise measurement Signal to noise ratio Speech Speech enhancement Training Visualization
Online Access	Get full text
DOI	10.1109/APSIPA.2016.7820732

Cover

Abstract	This paper proposes a novel framework that integrates audio and visual information for speech enhancement. Most speech enhancement approaches consider audio features only to design filters or transfer functions to convert noisy speech signals to clean ones. Visual data, which provide useful complementary information to audio data, have been integrated with audio data in many speech-related approaches to attain more effective speech processing performance. This paper presents our investigation into the use of the visual features of the motion of lips as additional visual information to improve the speech enhancement capability of deep neural network (DNN) speech enhancement performance. The experimental results show that the performance of DNN with audio-visual inputs exceeds that of DNN with audio inputs only in four standardized objective evaluations, thereby confirming the effectiveness of the inclusion of visual information into an audio-only speech enhancement framework.
AbstractList	This paper proposes a novel framework that integrates audio and visual information for speech enhancement. Most speech enhancement approaches consider audio features only to design filters or transfer functions to convert noisy speech signals to clean ones. Visual data, which provide useful complementary information to audio data, have been integrated with audio data in many speech-related approaches to attain more effective speech processing performance. This paper presents our investigation into the use of the visual features of the motion of lips as additional visual information to improve the speech enhancement capability of deep neural network (DNN) speech enhancement performance. The experimental results show that the performance of DNN with audio-visual inputs exceeds that of DNN with audio inputs only in four standardized objective evaluations, thereby confirming the effectiveness of the inclusion of visual information into an audio-only speech enhancement framework.
Author	Jen-Cheng Hou Ying-Hui Lai Hsiu-Wen Chang Syu-Siang Wang Hsin-Min Wang Yu Tsao Jen-Chun Lin
Author_xml	– sequence: 1 surname: Jen-Cheng Hou fullname: Jen-Cheng Hou email: coolkiu@citi.sinica.edu.tw organization: Res. Center for Inf. Technol. Innovation, Taipei, Taiwan – sequence: 2 surname: Syu-Siang Wang fullname: Syu-Siang Wang organization: Res. Center for Inf. Technol. Innovation, Taipei, Taiwan – sequence: 3 surname: Ying-Hui Lai fullname: Ying-Hui Lai email: yhlai@ee.yzu.edu.tw organization: Dept. of Electr. Eng., Yuan Ze Univ., Taoyuan, Taiwan – sequence: 4 surname: Jen-Chun Lin fullname: Jen-Chun Lin organization: Inst. of Inf. Sci., Taipei, Taiwan – sequence: 5 surname: Yu Tsao fullname: Yu Tsao organization: Res. Center for Inf. Technol. Innovation, Taipei, Taiwan – sequence: 6 surname: Hsiu-Wen Chang fullname: Hsiu-Wen Chang email: hsiuwen@mmc.edu.tw organization: Dept. of Audiology & Speech Language Pathology, Mackay Med. Coll., Taiwan – sequence: 7 surname: Hsin-Min Wang fullname: Hsin-Min Wang email: whm@iis.sinica.edu.tw organization: Inst. of Inf. Sci., Taipei, Taiwan
BookMark	eNotj8tqAjEUQCPUhVq_wE1-YKa5SZqHu0H6EASFupdMcqcGNTNMZlr69y3U1dkcDpw5eUhtQkJWwEoAZp-qw8f2UJWcgSq14UwLPiFzawxIrQw3M7KuxhDb4ivm0V1p7hD9mWI6u-TxhmmgY47pkwbEjiYc-z8p4fDd9pf8SKaNu2Zc3rkgx9eX4-a92O3ftptqV0TLhsIyKYw2NhjdNKq2SkLQGKBuUD0LxX1gQTqmhLdeaw9eMwQEjkHKGoISC7L6z0ZEPHV9vLn-53S_Eb_xAUSe
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/APSIPA.2016.7820732
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9881476828 9789881476821
EndPage	6
ExternalDocumentID	7820732
Genre	orig-research
GroupedDBID	6IE 6IL CBEJK RIE RIL
ID	FETCH-LOGICAL-i90t-90438789d87ff6b9641d7ed1bfe65362cd0d4a063c9c77c1c70e1e12ed44b1d63
IEDL.DBID	RIE
IngestDate	Thu Jun 29 18:38:22 EDT 2023
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i90t-90438789d87ff6b9641d7ed1bfe65362cd0d4a063c9c77c1c70e1e12ed44b1d63
PageCount	6
ParticipantIDs	ieee_primary_7820732
PublicationCentury	2000
PublicationDate	2016-Dec.
PublicationDateYYYYMMDD	2016-12-01
PublicationDate_xml	– month: 12 year: 2016 text: 2016-Dec.
PublicationDecade	2010
PublicationTitle	2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)
PublicationTitleAbbrev	APSIPA
PublicationYear	2016
Publisher	Asia Pacific Signal and Information Processing Association
Publisher_xml	– name: Asia Pacific Signal and Information Processing Association
Score	1.771901
Snippet	This paper proposes a novel framework that integrates audio and visual information for speech enhancement. Most speech enhancement approaches consider audio...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Feature extraction Noise measurement Signal to noise ratio Speech Speech enhancement Training Visualization
Title	Audio-visual speech enhancement using deep neural networks
URI	https://ieeexplore.ieee.org/document/7820732
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LSwMxEA61J08qrfgmB49mu9ls8_BWxFKFSsEKvZVNMrFF2Ra768Ffb7K7VhQP3kIIecxAviHzzReELmXinHHKEqmFISm1knhcZ4SFXHuqnUh4KE4eP_DRU3o_689a6GpbCwMAFfkMotCscvl2ZcrwVNYL2m6C-Qt3R0he12o1QkI0Vr3B5PFuMghsLR41I398mVIhxnAPjb_WqokiL1FZ6Mh8_JJh_O9m9lH3uzYPT7aoc4BakHfQ9aC0yxV5X27K7BVv1gBmgSFfBJeGaXCgtz9jC7DGQcHSD8pr_vemi6bD2-nNiDS_IpCliguiQupOSGWlcI5rxb19BViqHfC-RyNjY5tmPvAwyghhqBExUKAJ2DTV1HJ2iNr5KocjhPsZp9Z5zFcUUh0iReUSlimbOcaskceoE449X9e6F_PmxCd_d5-i3WD6mupxhtrFWwnnHrALfVF56hMHPZgK
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA5jHvSksom_7cGj6Zo2TRpvQ5RNtzFwwm6jSV7cULrhWg_-9SZtnSgevIUQ8utBvkfe976H0GUSGqOM0DiRXGFKdIItrkc4crF2Kg0PmUtOHo5Y74neT-NpA11tcmEAoCSfge-aZSxfL1Xhvso6TtuNR_bB3YoppXGVrVVLCZFAdLrjx_646_hazK_H_iiaUmLG3S4afq1WUUVe_CKXvvr4JcT43-3sofZ3dp433uDOPmpA1kLX3UIvlvh9sS7SV2-9AlBzD7K5M6qbxnME92dPA6w8p2FpB2UVA3zdRpO728lND9d1EfBCBDkWLnjHE6ETbgyTgtkb5qCJNMBii0dKB5qm1vVQQnGuiOIBECAhaEol0Sw6QM1smcEh8uKUEW0s6gsCVDpfUZgwSoVOTRRplRyhljv2bFUpX8zqEx__3X2BtnuT4WA26I8eTtCOM0NF_DhFzfytgDML37k8L632CS5Mm1c
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2016+Asia-Pacific+Signal+and+Information+Processing+Association+Annual+Summit+and+Conference+%28APSIPA%29&rft.atitle=Audio-visual+speech+enhancement+using+deep+neural+networks&rft.au=Jen-Cheng+Hou&rft.au=Syu-Siang+Wang&rft.au=Ying-Hui+Lai&rft.au=Jen-Chun+Lin&rft.date=2016-12-01&rft.pub=Asia+Pacific+Signal+and+Information+Processing+Association&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FAPSIPA.2016.7820732&rft.externalDocID=7820732