BUILDING A SCALABLE DATASET FOR FRIDAY SERMONS OF AUDIO AND TEXT (SAT)

Context. Today, collecting and creating datasets in various sectors has become increasingly prevalent. Despite this widespread data production, a gap still exists in specialized domains, particularly in the Islamic Friday Sermons (IFS) domain. It is rich with theological, cultural, and linguistic st...

Full description

Saved in:
Bibliographic Details
Published inRadìoelektronika, informatika, upravlìnnâ no. 2; p. 90
Main Authors Samah, A. A., Dimah, H. A., Hassanin, M. A.
Format Journal Article
LanguageEnglish
Published 27.06.2024
Online AccessGet full text
ISSN1607-3274
2313-688X
2313-688X
DOI10.15588/1607-3274-2024-2-10

Cover

Abstract Context. Today, collecting and creating datasets in various sectors has become increasingly prevalent. Despite this widespread data production, a gap still exists in specialized domains, particularly in the Islamic Friday Sermons (IFS) domain. It is rich with theological, cultural, and linguistic studies that are relevant to Arab and Muslim countries, not just religious discourses. Objective. The goal of this research is to bridge this lack by introducing a comprehensive Sermon Audio and Text (SAT) dataset with its metadata. It seeks to provide an extensive resource for religion, linguistics, and sociology studies. Moreover, it aims to support advancements in Artificial Intelligence (AI), such as Natural Language Processing and Speech Recognition technologies. Method. The development of the SAT dataset was conducted through four distinct phases: planning, creation and processing, measurement, and deployment. The SAT dataset contains a collection of 21,253 audio and corresponding transcript files that were successfully created. Advanced audio processing techniques were used to enhance speech recognition and provide a dataset that is suitable for wide-range use. Results. The fine-tuned SAT dataset achieved a 5.13% Word Error Rate (WER), indicating a significant improvement in accuracy compared to the baseline model of Microsoft Azure Speech. This achievement indicates the dataset’s quality and the employed processing techniques’ effectiveness. In light of this, a novel Closest Matching Phrase (CMP) algorithm was developed to enhance the high confidence of equivalent speech-to-text by adjusting lower ratio phrases. Conclusions. This research contributes significant impact and insight into different studies, such as religion, linguistics, and sociology, providing invaluable insights and resources. In addition, it is demonstrating its potential in Artificial Intelligence (AI) and supporting its applications. In future research, we will focus on enriching this dataset expansion by adding a sign language video corpus, using advanced alignment techniques. It will support ongoing Machine Translation (MT) developments for a broader understanding of Islamic Friday Sermons across different linguistics and cultures.
AbstractList Context. Today, collecting and creating datasets in various sectors has become increasingly prevalent. Despite this widespread data production, a gap still exists in specialized domains, particularly in the Islamic Friday Sermons (IFS) domain. It is rich with theological, cultural, and linguistic studies that are relevant to Arab and Muslim countries, not just religious discourses. Objective. The goal of this research is to bridge this lack by introducing a comprehensive Sermon Audio and Text (SAT) dataset with its metadata. It seeks to provide an extensive resource for religion, linguistics, and sociology studies. Moreover, it aims to support advancements in Artificial Intelligence (AI), such as Natural Language Processing and Speech Recognition technologies. Method. The development of the SAT dataset was conducted through four distinct phases: planning, creation and processing, measurement, and deployment. The SAT dataset contains a collection of 21,253 audio and corresponding transcript files that were successfully created. Advanced audio processing techniques were used to enhance speech recognition and provide a dataset that is suitable for wide-range use. Results. The fine-tuned SAT dataset achieved a 5.13% Word Error Rate (WER), indicating a significant improvement in accuracy compared to the baseline model of Microsoft Azure Speech. This achievement indicates the dataset’s quality and the employed processing techniques’ effectiveness. In light of this, a novel Closest Matching Phrase (CMP) algorithm was developed to enhance the high confidence of equivalent speech-to-text by adjusting lower ratio phrases. Conclusions. This research contributes significant impact and insight into different studies, such as religion, linguistics, and sociology, providing invaluable insights and resources. In addition, it is demonstrating its potential in Artificial Intelligence (AI) and supporting its applications. In future research, we will focus on enriching this dataset expansion by adding a sign language video corpus, using advanced alignment techniques. It will support ongoing Machine Translation (MT) developments for a broader understanding of Islamic Friday Sermons across different linguistics and cultures.
Author Hassanin, M. A.
Samah, A. A.
Dimah, H. A.
Author_xml – sequence: 1
  givenname: A. A.
  surname: Samah
  fullname: Samah, A. A.
– sequence: 2
  givenname: H. A.
  surname: Dimah
  fullname: Dimah, H. A.
– sequence: 3
  givenname: M. A.
  surname: Hassanin
  fullname: Hassanin, M. A.
BookMark eNqNkEFPg0AQhTemJtbaf-Bhj3pAd3bZXfC2LVBJsCSFJvW0gbIkGKQN0Jj-e6k1HrzoZeYw73uT967RqNk1BqFbIA_AueM8giDSYlTaFiV0GBaQCzSmDJglHGczQuMfxRWadt0bIQS4I8CWYxTM1mHkhcsFVjiZq0jNIh97KlWJn-IgXuFgFXrqFSf-6iVeJjgOsFp7YYzV0sOpv0nxXaLS-xt0WWZ1Z6bfe4LWgZ_On60oXoSDrbWlVPaWKUoqSsHyAoTr5DKj4OYAII1h1GWcy6K0c1a6ojSkcKEwQuaccQauEEw4bIL42ffQ7LPjR1bXet9W71l71ED0Vx_6lFaf0upTH5oOl4F7OnPbdtd1rSn1tuqzvto1fZtV9V-w_Qv-189PE3tuzA
CitedBy_id crossref_primary_10_1016_j_jksuci_2024_102165
ContentType Journal Article
DBID AAYXX
CITATION
ADTOC
UNPAY
DOI 10.15588/1607-3274-2024-2-10
DatabaseName CrossRef
Unpaywall for CDI: Periodical Content
Unpaywall
DatabaseTitle CrossRef
DatabaseTitleList CrossRef
Database_xml – sequence: 1
  dbid: UNPAY
  name: Unpaywall
  url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 2313-688X
ExternalDocumentID 10.15588/1607-3274-2024-2-10
10_15588_1607_3274_2024_2_10
GroupedDBID 9MQ
AAYXX
ADBBV
ALMA_UNASSIGNED_HOLDINGS
BCNDV
CITATION
GROUPED_DOAJ
ADTOC
UNPAY
ID FETCH-LOGICAL-c227t-edf26f63bd1698b7a219b1117ee3293557df4b3f96fe0d91de67b535319663683
IEDL.DBID UNPAY
ISSN 1607-3274
2313-688X
IngestDate Sun Sep 07 11:24:27 EDT 2025
Thu Apr 24 23:05:08 EDT 2025
Tue Jul 01 03:16:44 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 2
Language English
License https://creativecommons.org/licenses/by-sa/4.0
cc-by-sa
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c227t-edf26f63bd1698b7a219b1117ee3293557df4b3f96fe0d91de67b535319663683
OpenAccessLink https://proxy.k.utb.cz/login?url=https://doi.org/10.15588/1607-3274-2024-2-10
ParticipantIDs unpaywall_primary_10_15588_1607_3274_2024_2_10
crossref_citationtrail_10_15588_1607_3274_2024_2_10
crossref_primary_10_15588_1607_3274_2024_2_10
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2024-06-27
PublicationDateYYYYMMDD 2024-06-27
PublicationDate_xml – month: 06
  year: 2024
  text: 2024-06-27
  day: 27
PublicationDecade 2020
PublicationTitle Radìoelektronika, informatika, upravlìnnâ
PublicationYear 2024
SSID ssj0001586147
ssib018208917
ssib015895113
ssib044757822
Score 2.2699742
Snippet Context. Today, collecting and creating datasets in various sectors has become increasingly prevalent. Despite this widespread data production, a gap still...
SourceID unpaywall
crossref
SourceType Open Access Repository
Enrichment Source
Index Database
StartPage 90
Title BUILDING A SCALABLE DATASET FOR FRIDAY SERMONS OF AUDIO AND TEXT (SAT)
URI https://doi.org/10.15588/1607-3274-2024-2-10
UnpaywallVersion publishedVersion
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAFT
  databaseName: Open Access Digital Library
  customDbUrl:
  eissn: 2313-688X
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssib018208917
  issn: 1607-3274
  databaseCode: KQ8
  dateStart: 19990101
  isFulltext: true
  titleUrlDefault: http://grweb.coalliance.org/oadl/oadl.html
  providerName: Colorado Alliance of Research Libraries
– providerCode: PRVAON
  databaseName: DOAJ (Directory of Open Access Journals) eJournal Collection
  customDbUrl:
  eissn: 2313-688X
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0001586147
  issn: 2313-688X
  databaseCode: DOA
  dateStart: 19990101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 2313-688X
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssib044757822
  issn: 1607-3274
  databaseCode: M~E
  dateStart: 19990101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NT8IwFG8IHNSD30b8ID140MSSsW4dOxa2CQaGYSPB07Ju3UWCxECM_vW-wkbQxIiXXfb68vJe2_d-ad-vCN2wZlMYwqZEHfIQWImMxBqD5Q6pw7RZrFFL9Q73fdYZGY9jc1xC90UvzOb5vWkCOlMEaIQCdoJ46vAhqp-qwkCPVkaVkf_EnxWmKqSWb8k1KAEbxnmn3G9qvmWincV0Fn-8x5PJRnrxDlC_MGx1q-SlvpiLevL5g7NxW8sP0X5eZ2K-mhhHqCSnx2hvg33wBHmtUbfndP0HzHHQ5j3e6rnY4SEP3BADOMTesOvwZxy4Q9h4AzzwMIfycYC57-DQHYf4NuDh3SkaeW7Y7pD8WQWS6Lo1JzLNdJYxKtIGs5vCimHTErDlWVJSXdGtW2lmCJrZLJNaajdSySxh0uViZZQ16RkqT1-n8hxhk1ogCEk_MxVzGxSDjdSMQZuMYyOxkyqihYujJOccV09fTCKFPZSXIuWlSHkpUl6KdPhTRWQ9arbi3PhDvr6O3lYDLv474BLtLmOoMaJbV6g8f1vIayhG5qK2BPG1fCZ-ATVSxd0
linkProvider Unpaywall
linkToUnpaywall http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NT8IwFG8IHNSD30b8Sg8eNLFkrFu3HQtsgoFh2EjgtKxbd5FMYiBG_3pfYRA0MeJll_W9vLz2faV9v4fQLbNtYQiHEnXJQ8ASGYk1BuYOocN0WKxRS_UO93zWHhpPI3NUQg-rXpjN-3vThOpMAaARCrUT7KcOH6L6qSoM-GhlVBn6z3ysaqrVqsUsuTolIMOo6JT7jc23SLQzz6fxx3s8mWyEF-8A9VaCLV-VvNTmM1FLPn9gNm4r-SHaL_JMzJcH4wiVZH6M9jbQB0-Q1xh2uq2O_4g5Dpq8yxtdF7d4yAM3xFAcYm_QafExDtwBON4A9z3MIX3sY-63cOiOQnwX8PD-FA09N2y2STFWgSS6bs2ITDOdZYyKtM4cW1gxOC0BLs-SkuoKbt1KM0PQzGGZ1FKnnkpmCZMujJVRZtMzVM5fc3mOsEktWAhBPzMVchskg_XUjIGbjGMjcZIqoisVR0mBOa5GX0wiVXsoLUVKS5HSUqS0FOnwp4rImmq6xNz4Y31tvXtbEVz8l-AS7S72UGNEt65QefY2l9eQjMzETXEGvwCXpMTo
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=BUILDING+A+SCALABLE+DATASET+FOR+FRIDAY+SERMONS+OF+AUDIO+AND+TEXT+%28SAT%29&rft.jtitle=Rad%C3%ACoelektronika%2C+informatika%2C+upravl%C3%ACnn%C3%A2&rft.au=Samah%2C+A.+A.&rft.au=Dimah%2C+H.+A.&rft.au=Hassanin%2C+M.+A.&rft.date=2024-06-27&rft.issn=1607-3274&rft.eissn=2313-688X&rft.issue=2&rft.spage=90&rft_id=info:doi/10.15588%2F1607-3274-2024-2-10&rft.externalDBID=n%2Fa&rft.externalDocID=10_15588_1607_3274_2024_2_10
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1607-3274&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1607-3274&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1607-3274&client=summon