Audio-visual speech enhancement using deep neural networks

This paper proposes a novel framework that integrates audio and visual information for speech enhancement. Most speech enhancement approaches consider audio features only to design filters or transfer functions to convert noisy speech signals to clean ones. Visual data, which provide useful compleme...

Full description

Saved in:
Bibliographic Details
Published in2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) pp. 1 - 6
Main Authors Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Jen-Chun Lin, Yu Tsao, Hsiu-Wen Chang, Hsin-Min Wang
Format Conference Proceeding
LanguageEnglish
Published Asia Pacific Signal and Information Processing Association 01.12.2016
Subjects
Online AccessGet full text
DOI10.1109/APSIPA.2016.7820732

Cover

Abstract This paper proposes a novel framework that integrates audio and visual information for speech enhancement. Most speech enhancement approaches consider audio features only to design filters or transfer functions to convert noisy speech signals to clean ones. Visual data, which provide useful complementary information to audio data, have been integrated with audio data in many speech-related approaches to attain more effective speech processing performance. This paper presents our investigation into the use of the visual features of the motion of lips as additional visual information to improve the speech enhancement capability of deep neural network (DNN) speech enhancement performance. The experimental results show that the performance of DNN with audio-visual inputs exceeds that of DNN with audio inputs only in four standardized objective evaluations, thereby confirming the effectiveness of the inclusion of visual information into an audio-only speech enhancement framework.
AbstractList This paper proposes a novel framework that integrates audio and visual information for speech enhancement. Most speech enhancement approaches consider audio features only to design filters or transfer functions to convert noisy speech signals to clean ones. Visual data, which provide useful complementary information to audio data, have been integrated with audio data in many speech-related approaches to attain more effective speech processing performance. This paper presents our investigation into the use of the visual features of the motion of lips as additional visual information to improve the speech enhancement capability of deep neural network (DNN) speech enhancement performance. The experimental results show that the performance of DNN with audio-visual inputs exceeds that of DNN with audio inputs only in four standardized objective evaluations, thereby confirming the effectiveness of the inclusion of visual information into an audio-only speech enhancement framework.
Author Jen-Cheng Hou
Ying-Hui Lai
Hsiu-Wen Chang
Syu-Siang Wang
Hsin-Min Wang
Yu Tsao
Jen-Chun Lin
Author_xml – sequence: 1
  surname: Jen-Cheng Hou
  fullname: Jen-Cheng Hou
  email: coolkiu@citi.sinica.edu.tw
  organization: Res. Center for Inf. Technol. Innovation, Taipei, Taiwan
– sequence: 2
  surname: Syu-Siang Wang
  fullname: Syu-Siang Wang
  organization: Res. Center for Inf. Technol. Innovation, Taipei, Taiwan
– sequence: 3
  surname: Ying-Hui Lai
  fullname: Ying-Hui Lai
  email: yhlai@ee.yzu.edu.tw
  organization: Dept. of Electr. Eng., Yuan Ze Univ., Taoyuan, Taiwan
– sequence: 4
  surname: Jen-Chun Lin
  fullname: Jen-Chun Lin
  organization: Inst. of Inf. Sci., Taipei, Taiwan
– sequence: 5
  surname: Yu Tsao
  fullname: Yu Tsao
  organization: Res. Center for Inf. Technol. Innovation, Taipei, Taiwan
– sequence: 6
  surname: Hsiu-Wen Chang
  fullname: Hsiu-Wen Chang
  email: hsiuwen@mmc.edu.tw
  organization: Dept. of Audiology & Speech Language Pathology, Mackay Med. Coll., Taiwan
– sequence: 7
  surname: Hsin-Min Wang
  fullname: Hsin-Min Wang
  email: whm@iis.sinica.edu.tw
  organization: Inst. of Inf. Sci., Taipei, Taiwan
BookMark eNotj8tqAjEUQCPUhVq_wE1-YKa5SZqHu0H6EASFupdMcqcGNTNMZlr69y3U1dkcDpw5eUhtQkJWwEoAZp-qw8f2UJWcgSq14UwLPiFzawxIrQw3M7KuxhDb4ivm0V1p7hD9mWI6u-TxhmmgY47pkwbEjiYc-z8p4fDd9pf8SKaNu2Zc3rkgx9eX4-a92O3ftptqV0TLhsIyKYw2NhjdNKq2SkLQGKBuUD0LxX1gQTqmhLdeaw9eMwQEjkHKGoISC7L6z0ZEPHV9vLn-53S_Eb_xAUSe
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/APSIPA.2016.7820732
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9881476828
9789881476821
EndPage 6
ExternalDocumentID 7820732
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i90t-90438789d87ff6b9641d7ed1bfe65362cd0d4a063c9c77c1c70e1e12ed44b1d63
IEDL.DBID RIE
IngestDate Thu Jun 29 18:38:22 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i90t-90438789d87ff6b9641d7ed1bfe65362cd0d4a063c9c77c1c70e1e12ed44b1d63
PageCount 6
ParticipantIDs ieee_primary_7820732
PublicationCentury 2000
PublicationDate 2016-Dec.
PublicationDateYYYYMMDD 2016-12-01
PublicationDate_xml – month: 12
  year: 2016
  text: 2016-Dec.
PublicationDecade 2010
PublicationTitle 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)
PublicationTitleAbbrev APSIPA
PublicationYear 2016
Publisher Asia Pacific Signal and Information Processing Association
Publisher_xml – name: Asia Pacific Signal and Information Processing Association
Score 1.771901
Snippet This paper proposes a novel framework that integrates audio and visual information for speech enhancement. Most speech enhancement approaches consider audio...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Feature extraction
Noise measurement
Signal to noise ratio
Speech
Speech enhancement
Training
Visualization
Title Audio-visual speech enhancement using deep neural networks
URI https://ieeexplore.ieee.org/document/7820732
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LSwMxEA61J08qrfgmB49mu9ls8_BWxFKFSsEKvZVNMrFF2Ra768Ffb7K7VhQP3kIIecxAviHzzReELmXinHHKEqmFISm1knhcZ4SFXHuqnUh4KE4eP_DRU3o_689a6GpbCwMAFfkMotCscvl2ZcrwVNYL2m6C-Qt3R0he12o1QkI0Vr3B5PFuMghsLR41I398mVIhxnAPjb_WqokiL1FZ6Mh8_JJh_O9m9lH3uzYPT7aoc4BakHfQ9aC0yxV5X27K7BVv1gBmgSFfBJeGaXCgtz9jC7DGQcHSD8pr_vemi6bD2-nNiDS_IpCliguiQupOSGWlcI5rxb19BViqHfC-RyNjY5tmPvAwyghhqBExUKAJ2DTV1HJ2iNr5KocjhPsZp9Z5zFcUUh0iReUSlimbOcaskceoE449X9e6F_PmxCd_d5-i3WD6mupxhtrFWwnnHrALfVF56hMHPZgK
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA5jHvSksom_7cGj6Zo2TRpvQ5RNtzFwwm6jSV7cULrhWg_-9SZtnSgevIUQ8utBvkfe976H0GUSGqOM0DiRXGFKdIItrkc4crF2Kg0PmUtOHo5Y74neT-NpA11tcmEAoCSfge-aZSxfL1Xhvso6TtuNR_bB3YoppXGVrVVLCZFAdLrjx_646_hazK_H_iiaUmLG3S4afq1WUUVe_CKXvvr4JcT43-3sofZ3dp433uDOPmpA1kLX3UIvlvh9sS7SV2-9AlBzD7K5M6qbxnME92dPA6w8p2FpB2UVA3zdRpO728lND9d1EfBCBDkWLnjHE6ETbgyTgtkb5qCJNMBii0dKB5qm1vVQQnGuiOIBECAhaEol0Sw6QM1smcEh8uKUEW0s6gsCVDpfUZgwSoVOTRRplRyhljv2bFUpX8zqEx__3X2BtnuT4WA26I8eTtCOM0NF_DhFzfytgDML37k8L632CS5Mm1c
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2016+Asia-Pacific+Signal+and+Information+Processing+Association+Annual+Summit+and+Conference+%28APSIPA%29&rft.atitle=Audio-visual+speech+enhancement+using+deep+neural+networks&rft.au=Jen-Cheng+Hou&rft.au=Syu-Siang+Wang&rft.au=Ying-Hui+Lai&rft.au=Jen-Chun+Lin&rft.date=2016-12-01&rft.pub=Asia+Pacific+Signal+and+Information+Processing+Association&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FAPSIPA.2016.7820732&rft.externalDocID=7820732