Audio-visual speech recognition using deep bottleneck features and high-performance lipreading

This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep bottleneck features to improve AVSR performance, and (3) investigating effectiveness of voice activity detection in a visual modality. In our ap...

Full description

Saved in:
Bibliographic Details
Published in2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) pp. 575 - 582
Main Authors Tamura, Satoshi, Ninomiya, Hiroshi, Kitaoka, Norihide, Osuga, Shin, Iribe, Yurie, Takeda, Kazuya, Hayamizu, Satoru
Format Conference Proceeding
LanguageEnglish
Japanese
Published Asia-Pacific Signal and Information Processing Association 01.12.2015
Subjects
Online AccessGet full text
DOI10.1109/APSIPA.2015.7415335

Cover

Abstract This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep bottleneck features to improve AVSR performance, and (3) investigating effectiveness of voice activity detection in a visual modality. In our approach, many kinds of visual features are incorporated, subsequently converted into bottleneck features by deep learning technology. By using proposed features, we successfully achieved 73.66% lipreading accuracy in speaker-independent open condition, and about 90% AVSR accuracy on average in noisy environments. In addition, we extracted speech segments from visual features, resulting 77.80% lipreading accuracy. It is found VAD is useful in both audio and visual modalities, for better lipreading and AVSR.
AbstractList This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep bottleneck features to improve AVSR performance, and (3) investigating effectiveness of voice activity detection in a visual modality. In our approach, many kinds of visual features are incorporated, subsequently converted into bottleneck features by deep learning technology. By using proposed features, we successfully achieved 73.66% lipreading accuracy in speaker-independent open condition, and about 90% AVSR accuracy on average in noisy environments. In addition, we extracted speech segments from visual features, resulting 77.80% lipreading accuracy. It is found VAD is useful in both audio and visual modalities, for better lipreading and AVSR.
Author Osuga, Shin
Ninomiya, Hiroshi
Iribe, Yurie
Kitaoka, Norihide
Takeda, Kazuya
Tamura, Satoshi
Hayamizu, Satoru
Author_xml – sequence: 1
  givenname: Satoshi
  surname: Tamura
  fullname: Tamura, Satoshi
  email: tamura@info.gifu-u.ac.jp
  organization: Gifu Univ., Gifu, Japan
– sequence: 2
  givenname: Hiroshi
  surname: Ninomiya
  fullname: Ninomiya, Hiroshi
  email: ninomiya.hiroshi@g.sp.m.is.nagoya-u.ac.jp
  organization: Nagoya Univ., Nagoya, Japan
– sequence: 3
  givenname: Norihide
  surname: Kitaoka
  fullname: Kitaoka, Norihide
  email: kitaoka@is.tokushima-u.ac.jp
  organization: Tokushima Univ., Tokushima, Japan
– sequence: 4
  givenname: Shin
  surname: Osuga
  fullname: Osuga, Shin
  email: sohsuga@elec.aisin.co.jp
  organization: Aisin Seiki Co., Ltd., Kariya, Japan
– sequence: 5
  givenname: Yurie
  surname: Iribe
  fullname: Iribe, Yurie
  email: iribe@ist.aichi-pu.ac.jp
  organization: Aichi Prefectural Univ., Nagakute, Japan
– sequence: 6
  givenname: Kazuya
  surname: Takeda
  fullname: Takeda, Kazuya
  email: takeda@is.nagoya-u.ac.jp
  organization: Nagoya Univ., Nagoya, Japan
– sequence: 7
  givenname: Satoru
  surname: Hayamizu
  fullname: Hayamizu, Satoru
  email: hayamizu@gifu-u.ac.jp
  organization: Gifu Univ., Gifu, Japan
BookMark eNotj91KwzAYQCPohZs-wW7yAq35kqVJLsvwZzBwoN468vO1DXZpSVvBt1dwV-fqHDgrcp2GhIRsgJUAzDzUx7f9sS45A1mqLUgh5BVZGa1hqyrN4JZ81kuIQ_Edp8X2dBoRfUcz-qFNcY5DossUU0sD4kjdMM89JvRftEE7LxknalOgXWy7YsTcDPlsk0faxzGjDX_iHblpbD_h_YVr8vH0-L57KQ6vz_tdfSgiyGoujNfGBwcVNKZyXCtQQaESFpRrKmW40JIZ5aUzQXEWgCP33iAHI1xwQqzJ5r8bEfE05ni2-ed0WRa_6qxSRg
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/APSIPA.2015.7415335
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Xplore Digital Library (LUT)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9881476801
9789881476807
EndPage 582
ExternalDocumentID 7415335
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i156t-9c89cdb161f96b28717d7e73a17bf6792385097c5b9d720d12e2cc9e2193bdb33
IEDL.DBID RIE
IngestDate Thu Jun 29 18:36:27 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
Japanese
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i156t-9c89cdb161f96b28717d7e73a17bf6792385097c5b9d720d12e2cc9e2193bdb33
PageCount 8
ParticipantIDs ieee_primary_7415335
PublicationCentury 2000
PublicationDate 2015-12
PublicationDateYYYYMMDD 2015-12-01
PublicationDate_xml – month: 12
  year: 2015
  text: 2015-12
PublicationDecade 2010
PublicationTitle 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)
PublicationTitleAbbrev APSIPA
PublicationYear 2015
Publisher Asia-Pacific Signal and Information Processing Association
Publisher_xml – name: Asia-Pacific Signal and Information Processing Association
Score 1.7786301
Snippet This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep...
SourceID ieee
SourceType Publisher
StartPage 575
SubjectTerms Discrete cosine transforms
Feature extraction
Hidden Markov models
Mouth
Principal component analysis
Speech recognition
Visualization
Title Audio-visual speech recognition using deep bottleneck features and high-performance lipreading
URI https://ieeexplore.ieee.org/document/7415335
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8MwGA7bTp5UNvGbHDyark0_0hyHOKYwGehgJ0c-3mrZaMvWevDXm7R1Q_HgLYRAQgJ53jd5nudF6IZKgzm-iAinipMgdF3CJaPEs2Bqsh9FpRU4T5-iyTx4XISLDrrdaWEAoCafgWOb9V--zlVln8qGFv18P-yiLoujRqvVGgl5Lh-OZs8Ps5Fla4VOO_JHyZQaMcaHaPo9V0MUWTlVKR31-cuG8b-LOUKDvTYPz3aoc4w6kPXR66jSaU4-0m0l1nhbAKh3vOMG5Rm29PY3rAEKbGuHGagBtcIJ1LaeWywyja1xMSn2OgK8TotNQ7EfoPn4_uVuQtrKCSQ1-VhJuIq50tJEcwmPpE2KmGbAfOExmUTWMjA2gQJToeSaUVd7FKhSHMz15Ustff8E9bI8g1OEuTIBSMJooMELZCKkCCIquAiU9XHxvDPUt3uzLBpzjGW7Led_d1-gA3s-DR_kEvXKTQVXBtVLeV0f5xdkCaS4
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwGG0QD3pSA8bf9uDRDtZ1Gz0SIwEFQiIknCRr-00XyLbA5sG_3nabEI0Hb03TpE2b9H1f-977ELqjQmOOE3iEU8kJc9ttwoVPiW3AVGc_kgojcB6Nvf6MPc3deQ3db7UwAFCQz8AyzeIvXyUyN09lLYN-juPuoX2XMeaWaq3KSshu81Z38jKYdA1fy7WqsT-KphSY0TtCo-_ZSqrI0sozYcnPX0aM_13OMWru1Hl4ssWdE1SDuIFeu7mKEvIRbfJghTcpgHzHW3ZQEmNDcH_DCiDFpnqYBhuQSxxCYey5wUGssLEuJulOSYBXUbouSfZNNOs9Th_6pKqdQCKdkWWEyw6XSuh4LuSeMGmRr3zwncD2RegZ08CODhV86QqufNpWNgUqJQd9gTlCCcc5RfU4ieEMYS51CBL6lCmwmQgDETCPBjxg0ji52PY5api9WaSlPcai2paLv7tv0UF_OhouhoPx8yU6NGdVskOuUD1b53CtMT4TN8XRfgEv2qgF
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2015+Asia-Pacific+Signal+and+Information+Processing+Association+Annual+Summit+and+Conference+%28APSIPA%29&rft.atitle=Audio-visual+speech+recognition+using+deep+bottleneck+features+and+high-performance+lipreading&rft.au=Tamura%2C+Satoshi&rft.au=Ninomiya%2C+Hiroshi&rft.au=Kitaoka%2C+Norihide&rft.au=Osuga%2C+Shin&rft.date=2015-12-01&rft.pub=Asia-Pacific+Signal+and+Information+Processing+Association&rft.spage=575&rft.epage=582&rft_id=info:doi/10.1109%2FAPSIPA.2015.7415335&rft.externalDocID=7415335