Audio-visual speech recognition using deep bottleneck features and high-performance lipreading
This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep bottleneck features to improve AVSR performance, and (3) investigating effectiveness of voice activity detection in a visual modality. In our ap...
Saved in:
| Published in | 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) pp. 575 - 582 |
|---|---|
| Main Authors | , , , , , , |
| Format | Conference Proceeding |
| Language | English Japanese |
| Published |
Asia-Pacific Signal and Information Processing Association
01.12.2015
|
| Subjects | |
| Online Access | Get full text |
| DOI | 10.1109/APSIPA.2015.7415335 |
Cover
| Abstract | This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep bottleneck features to improve AVSR performance, and (3) investigating effectiveness of voice activity detection in a visual modality. In our approach, many kinds of visual features are incorporated, subsequently converted into bottleneck features by deep learning technology. By using proposed features, we successfully achieved 73.66% lipreading accuracy in speaker-independent open condition, and about 90% AVSR accuracy on average in noisy environments. In addition, we extracted speech segments from visual features, resulting 77.80% lipreading accuracy. It is found VAD is useful in both audio and visual modalities, for better lipreading and AVSR. |
|---|---|
| AbstractList | This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep bottleneck features to improve AVSR performance, and (3) investigating effectiveness of voice activity detection in a visual modality. In our approach, many kinds of visual features are incorporated, subsequently converted into bottleneck features by deep learning technology. By using proposed features, we successfully achieved 73.66% lipreading accuracy in speaker-independent open condition, and about 90% AVSR accuracy on average in noisy environments. In addition, we extracted speech segments from visual features, resulting 77.80% lipreading accuracy. It is found VAD is useful in both audio and visual modalities, for better lipreading and AVSR. |
| Author | Osuga, Shin Ninomiya, Hiroshi Iribe, Yurie Kitaoka, Norihide Takeda, Kazuya Tamura, Satoshi Hayamizu, Satoru |
| Author_xml | – sequence: 1 givenname: Satoshi surname: Tamura fullname: Tamura, Satoshi email: tamura@info.gifu-u.ac.jp organization: Gifu Univ., Gifu, Japan – sequence: 2 givenname: Hiroshi surname: Ninomiya fullname: Ninomiya, Hiroshi email: ninomiya.hiroshi@g.sp.m.is.nagoya-u.ac.jp organization: Nagoya Univ., Nagoya, Japan – sequence: 3 givenname: Norihide surname: Kitaoka fullname: Kitaoka, Norihide email: kitaoka@is.tokushima-u.ac.jp organization: Tokushima Univ., Tokushima, Japan – sequence: 4 givenname: Shin surname: Osuga fullname: Osuga, Shin email: sohsuga@elec.aisin.co.jp organization: Aisin Seiki Co., Ltd., Kariya, Japan – sequence: 5 givenname: Yurie surname: Iribe fullname: Iribe, Yurie email: iribe@ist.aichi-pu.ac.jp organization: Aichi Prefectural Univ., Nagakute, Japan – sequence: 6 givenname: Kazuya surname: Takeda fullname: Takeda, Kazuya email: takeda@is.nagoya-u.ac.jp organization: Nagoya Univ., Nagoya, Japan – sequence: 7 givenname: Satoru surname: Hayamizu fullname: Hayamizu, Satoru email: hayamizu@gifu-u.ac.jp organization: Gifu Univ., Gifu, Japan |
| BookMark | eNotj91KwzAYQCPohZs-wW7yAq35kqVJLsvwZzBwoN468vO1DXZpSVvBt1dwV-fqHDgrcp2GhIRsgJUAzDzUx7f9sS45A1mqLUgh5BVZGa1hqyrN4JZ81kuIQ_Edp8X2dBoRfUcz-qFNcY5DossUU0sD4kjdMM89JvRftEE7LxknalOgXWy7YsTcDPlsk0faxzGjDX_iHblpbD_h_YVr8vH0-L57KQ6vz_tdfSgiyGoujNfGBwcVNKZyXCtQQaESFpRrKmW40JIZ5aUzQXEWgCP33iAHI1xwQqzJ5r8bEfE05ni2-ed0WRa_6qxSRg |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/APSIPA.2015.7415335 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Xplore Digital Library (LUT) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9881476801 9789881476807 |
| EndPage | 582 |
| ExternalDocumentID | 7415335 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IL CBEJK RIE RIL |
| ID | FETCH-LOGICAL-i156t-9c89cdb161f96b28717d7e73a17bf6792385097c5b9d720d12e2cc9e2193bdb33 |
| IEDL.DBID | RIE |
| IngestDate | Thu Jun 29 18:36:27 EDT 2023 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English Japanese |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i156t-9c89cdb161f96b28717d7e73a17bf6792385097c5b9d720d12e2cc9e2193bdb33 |
| PageCount | 8 |
| ParticipantIDs | ieee_primary_7415335 |
| PublicationCentury | 2000 |
| PublicationDate | 2015-12 |
| PublicationDateYYYYMMDD | 2015-12-01 |
| PublicationDate_xml | – month: 12 year: 2015 text: 2015-12 |
| PublicationDecade | 2010 |
| PublicationTitle | 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) |
| PublicationTitleAbbrev | APSIPA |
| PublicationYear | 2015 |
| Publisher | Asia-Pacific Signal and Information Processing Association |
| Publisher_xml | – name: Asia-Pacific Signal and Information Processing Association |
| Score | 1.7786301 |
| Snippet | This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 575 |
| SubjectTerms | Discrete cosine transforms Feature extraction Hidden Markov models Mouth Principal component analysis Speech recognition Visualization |
| Title | Audio-visual speech recognition using deep bottleneck features and high-performance lipreading |
| URI | https://ieeexplore.ieee.org/document/7415335 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8MwGA7bTp5UNvGbHDyark0_0hyHOKYwGehgJ0c-3mrZaMvWevDXm7R1Q_HgLYRAQgJ53jd5nudF6IZKgzm-iAinipMgdF3CJaPEs2Bqsh9FpRU4T5-iyTx4XISLDrrdaWEAoCafgWOb9V--zlVln8qGFv18P-yiLoujRqvVGgl5Lh-OZs8Ps5Fla4VOO_JHyZQaMcaHaPo9V0MUWTlVKR31-cuG8b-LOUKDvTYPz3aoc4w6kPXR66jSaU4-0m0l1nhbAKh3vOMG5Rm29PY3rAEKbGuHGagBtcIJ1LaeWywyja1xMSn2OgK8TotNQ7EfoPn4_uVuQtrKCSQ1-VhJuIq50tJEcwmPpE2KmGbAfOExmUTWMjA2gQJToeSaUVd7FKhSHMz15Ustff8E9bI8g1OEuTIBSMJooMELZCKkCCIquAiU9XHxvDPUt3uzLBpzjGW7Led_d1-gA3s-DR_kEvXKTQVXBtVLeV0f5xdkCaS4 |
| linkProvider | IEEE |
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwGG0QD3pSA8bf9uDRDtZ1Gz0SIwEFQiIknCRr-00XyLbA5sG_3nabEI0Hb03TpE2b9H1f-977ELqjQmOOE3iEU8kJc9ttwoVPiW3AVGc_kgojcB6Nvf6MPc3deQ3db7UwAFCQz8AyzeIvXyUyN09lLYN-juPuoX2XMeaWaq3KSshu81Z38jKYdA1fy7WqsT-KphSY0TtCo-_ZSqrI0sozYcnPX0aM_13OMWru1Hl4ssWdE1SDuIFeu7mKEvIRbfJghTcpgHzHW3ZQEmNDcH_DCiDFpnqYBhuQSxxCYey5wUGssLEuJulOSYBXUbouSfZNNOs9Th_6pKqdQCKdkWWEyw6XSuh4LuSeMGmRr3zwncD2RegZ08CODhV86QqufNpWNgUqJQd9gTlCCcc5RfU4ieEMYS51CBL6lCmwmQgDETCPBjxg0ji52PY5api9WaSlPcai2paLv7tv0UF_OhouhoPx8yU6NGdVskOuUD1b53CtMT4TN8XRfgEv2qgF |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2015+Asia-Pacific+Signal+and+Information+Processing+Association+Annual+Summit+and+Conference+%28APSIPA%29&rft.atitle=Audio-visual+speech+recognition+using+deep+bottleneck+features+and+high-performance+lipreading&rft.au=Tamura%2C+Satoshi&rft.au=Ninomiya%2C+Hiroshi&rft.au=Kitaoka%2C+Norihide&rft.au=Osuga%2C+Shin&rft.date=2015-12-01&rft.pub=Asia-Pacific+Signal+and+Information+Processing+Association&rft.spage=575&rft.epage=582&rft_id=info:doi/10.1109%2FAPSIPA.2015.7415335&rft.externalDocID=7415335 |