Improved Mask-Based Neural Beamforming for Multichannel Speech Enhancement by Snapshot Matching Masking

In multichannel speech enhancement (SE), time-frequency (T-F) mask-based neural beamforming algorithms take advantage of deep neural networks to predict T-F masks that represent speech and noise dominance. The predicted masks are subsequently leveraged to estimate the speech and noise power spectral...

Full description

Saved in:
Bibliographic Details
Published inProceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) pp. 1 - 5
Main Authors Lee, Ching-Hua, Yang, Chouchang, Shen, Yilin, Jin, Hongxia
Format Conference Proceeding
LanguageEnglish
Published IEEE 04.06.2023
Subjects
Online AccessGet full text
ISSN2379-190X
DOI10.1109/ICASSP49357.2023.10096213

Cover

Abstract In multichannel speech enhancement (SE), time-frequency (T-F) mask-based neural beamforming algorithms take advantage of deep neural networks to predict T-F masks that represent speech and noise dominance. The predicted masks are subsequently leveraged to estimate the speech and noise power spectral density (PSD) matrices for computing the beamformer filter weights based on signal statistics. However, in the literature most networks are trained to estimate some pre-defined masks, e.g., the ideal binary mask (IBM) and ideal ratio mask (IRM) that lack direct connection to the PSD estimation. In this paper, we propose a new masking strategy to predict the Snapshot Matching Mask (SMM) that aims to minimize the distance between the predicted and the true signal snapshots, thereby estimating the PSD matrices in a more systematic way. Performance of SMM compared with existing IBM- and IRM-based PSD estimation for mask-based neural beamforming is presented on several datasets to demonstrate its effectiveness for the SE task.
AbstractList In multichannel speech enhancement (SE), time-frequency (T-F) mask-based neural beamforming algorithms take advantage of deep neural networks to predict T-F masks that represent speech and noise dominance. The predicted masks are subsequently leveraged to estimate the speech and noise power spectral density (PSD) matrices for computing the beamformer filter weights based on signal statistics. However, in the literature most networks are trained to estimate some pre-defined masks, e.g., the ideal binary mask (IBM) and ideal ratio mask (IRM) that lack direct connection to the PSD estimation. In this paper, we propose a new masking strategy to predict the Snapshot Matching Mask (SMM) that aims to minimize the distance between the predicted and the true signal snapshots, thereby estimating the PSD matrices in a more systematic way. Performance of SMM compared with existing IBM- and IRM-based PSD estimation for mask-based neural beamforming is presented on several datasets to demonstrate its effectiveness for the SE task.
Author Lee, Ching-Hua
Jin, Hongxia
Yang, Chouchang
Shen, Yilin
Author_xml – sequence: 1
  givenname: Ching-Hua
  surname: Lee
  fullname: Lee, Ching-Hua
  organization: Samsung Research America
– sequence: 2
  givenname: Chouchang
  surname: Yang
  fullname: Yang, Chouchang
  organization: Samsung Research America
– sequence: 3
  givenname: Yilin
  surname: Shen
  fullname: Shen, Yilin
  organization: Samsung Research America
– sequence: 4
  givenname: Hongxia
  surname: Jin
  fullname: Jin, Hongxia
  organization: Samsung Research America
BookMark eNo1kEFOwzAQRQ0CibZwAxbmACm2J7HjJa1aqNQCUkBiV02dSRNInChJkXp7UgGr92fxn_RnzC587YmxOymmUgp7v5o_JMlraCEyUyUUTKUQVisJZ2wsjYqlBmXMORspMDaQVnxcsXHXfQohYhPGI7ZfVU1bf1PKN9h9BTPshvhMhxZLPiOssrqtCr_nA_nmUPaFy9F7KnnSELmcL_xwO6rI93x35InHpsvrfrD1Lj8VT9qB1-wyw7Kjmz9O2Pty8TZ_CtYvj8OIdVDI2PYBhEqrNFXgHMQAoVMizhCsI4rRoN5laaS01NpJ1JFFszNRJFGBlqCjkGDCbn-9BRFtm7aosD1u_78CPwKcWl0
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ICASSP49357.2023.10096213
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE/IET Electronic Library
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISBN 1728163277
9781728163277
EISSN 2379-190X
EndPage 5
ExternalDocumentID 10096213
Genre orig-research
GroupedDBID 23M
6IE
6IF
6IH
6IK
6IL
6IM
6IN
AAJGR
AAWTH
ABLEC
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
RNS
ID FETCH-LOGICAL-i189t-34262dd23cc38334c208fa39cee8a7a6bfd526166c1a659a7b7551a23613654e3
IEDL.DBID RIE
IngestDate Wed Aug 27 02:35:10 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i189t-34262dd23cc38334c208fa39cee8a7a6bfd526166c1a659a7b7551a23613654e3
PageCount 5
ParticipantIDs ieee_primary_10096213
PublicationCentury 2000
PublicationDate 2023-June-4
PublicationDateYYYYMMDD 2023-06-04
PublicationDate_xml – month: 06
  year: 2023
  text: 2023-June-4
  day: 04
PublicationDecade 2020
PublicationTitle Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998)
PublicationTitleAbbrev ICASSP
PublicationYear 2023
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0008748
Score 2.2714012
Snippet In multichannel speech enhancement (SE), time-frequency (T-F) mask-based neural beamforming algorithms take advantage of deep neural networks to predict T-F...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Array signal processing
Estimation
neural beamforming
Neural networks
power spectral density
Signal processing algorithms
Simulation
snapshot
Speech enhancement
Systematics
Time-frequency analysis
time-frequency mask
Title Improved Mask-Based Neural Beamforming for Multichannel Speech Enhancement by Snapshot Matching Masking
URI https://ieeexplore.ieee.org/document/10096213
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1bS8MwFA66B9EXbxPvRPC1tWvSJH10Y2MKjkEd7G2cXOpksxuue9Bfb5Ju3kDwKaW06SFpc76Tnu87CF0bEDZEU42AU50HVEIagMp5wGLQQhuSEO2zLXqsO6D3w2S4Iqt7LowxxiefmdAd-n_5eqaWbqvMfuEWcMeuRu0mF6wia30uu4JTsYWuViKaN3et2yzr05QkPHQlwsP1zT_KqHgv0tlFvfXzq-SRSbgsZajef0kz_tvAPVT_Iuzh_qcr2kcbpjhAO9-0Bg_RU7V9YDR-gMUkaFrvpbGT5oApbhp4ceDVXohtiz0r11GCCzPF2dwYNcbtYuxeEGcAlm84K2C-GM9K21vp0zF9t7ato0Gn_djqBqsqC8FzQ6RlQJwmvdYxUcpGq4SqOBI5kNSaLIADk7lObJjFmGoAS1LgkluUBU60hbCEGnKEasWsMMcIU2GjTYiBQOKEBC3aiJhdUlKIjIQolyeo7sZsNK-ENEbr4Tr94_wZ2nZT5zOz6Dmqla9Lc2ExQCkv_dx_ABXcsUQ
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1ba8IwFA7DwS4vuzl2XwZ7bVebpJfHKYrbVIQq-CYnl86hqzLrw_brl6TqLjDYU0ppwyFNc85Jvu87CN0qiHSKJipOSGXqUA6xAyINncAHGUlFGJEWbdEJmn36OGCDJVndcmGUUhZ8plxzac_y5VQszFaZ_sN1wO2bGrWbjFLKCrrWeuGNQhptoZuljObdQ-0-Sbo0Jix0TZFwd_X6j0Iq1o809lBnZUEBHxm7i5y74uOXOOO_TdxH5S_KHu6undEB2lDZIdr9pjZ4hJ6LDQQlcRvmY6eq_ZfERpwDJriq4NWEr_pBrFtsebmGFJypCU5mSokRrmcjM0WMAZi_4ySD2Xw0zXVvuQVk2m51W0b9Rr1XazrLOgvOSyWKc4cYVXopfSKEzlcJFb4XpUBibXIEIQQ8lUwnWkEgKhCwGEIe6jgLjGwLCRhV5BiVsmmmThCmkc43wQcCzEgJ6njDC_SiEoOnOHgpP0VlM2bDWSGlMVwN19kf96_RdrPXbg1bD52nc7RjPqPFadELVMrfFupSRwQ5v7Lz4BONsLSR
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+of+the+...+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%281998%29&rft.atitle=Improved+Mask-Based+Neural+Beamforming+for+Multichannel+Speech+Enhancement+by+Snapshot+Matching+Masking&rft.au=Lee%2C+Ching-Hua&rft.au=Yang%2C+Chouchang&rft.au=Shen%2C+Yilin&rft.au=Jin%2C+Hongxia&rft.date=2023-06-04&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=1&rft.epage=5&rft_id=info:doi/10.1109%2FICASSP49357.2023.10096213&rft.externalDocID=10096213