BUILDING A SCALABLE DATASET FOR FRIDAY SERMONS OF AUDIO AND TEXT (SAT)

Context. Today, collecting and creating datasets in various sectors has become increasingly prevalent. Despite this widespread data production, a gap still exists in specialized domains, particularly in the Islamic Friday Sermons (IFS) domain. It is rich with theological, cultural, and linguistic st...

Full description

Saved in:

Bibliographic Details
Published in	Radìoelektronika, informatika, upravlìnnâ no. 2; p. 90
Main Authors	Samah, A. A., Dimah, H. A., Hassanin, M. A.
Format	Journal Article
Language	English
Published	27.06.2024
Online Access	Get full text
ISSN	1607-3274 2313-688X 2313-688X
DOI	10.15588/1607-3274-2024-2-10

Cover

Abstract	Context. Today, collecting and creating datasets in various sectors has become increasingly prevalent. Despite this widespread data production, a gap still exists in specialized domains, particularly in the Islamic Friday Sermons (IFS) domain. It is rich with theological, cultural, and linguistic studies that are relevant to Arab and Muslim countries, not just religious discourses. Objective. The goal of this research is to bridge this lack by introducing a comprehensive Sermon Audio and Text (SAT) dataset with its metadata. It seeks to provide an extensive resource for religion, linguistics, and sociology studies. Moreover, it aims to support advancements in Artificial Intelligence (AI), such as Natural Language Processing and Speech Recognition technologies. Method. The development of the SAT dataset was conducted through four distinct phases: planning, creation and processing, measurement, and deployment. The SAT dataset contains a collection of 21,253 audio and corresponding transcript files that were successfully created. Advanced audio processing techniques were used to enhance speech recognition and provide a dataset that is suitable for wide-range use. Results. The fine-tuned SAT dataset achieved a 5.13% Word Error Rate (WER), indicating a significant improvement in accuracy compared to the baseline model of Microsoft Azure Speech. This achievement indicates the dataset’s quality and the employed processing techniques’ effectiveness. In light of this, a novel Closest Matching Phrase (CMP) algorithm was developed to enhance the high confidence of equivalent speech-to-text by adjusting lower ratio phrases. Conclusions. This research contributes significant impact and insight into different studies, such as religion, linguistics, and sociology, providing invaluable insights and resources. In addition, it is demonstrating its potential in Artificial Intelligence (AI) and supporting its applications. In future research, we will focus on enriching this dataset expansion by adding a sign language video corpus, using advanced alignment techniques. It will support ongoing Machine Translation (MT) developments for a broader understanding of Islamic Friday Sermons across different linguistics and cultures.
AbstractList	Context. Today, collecting and creating datasets in various sectors has become increasingly prevalent. Despite this widespread data production, a gap still exists in specialized domains, particularly in the Islamic Friday Sermons (IFS) domain. It is rich with theological, cultural, and linguistic studies that are relevant to Arab and Muslim countries, not just religious discourses. Objective. The goal of this research is to bridge this lack by introducing a comprehensive Sermon Audio and Text (SAT) dataset with its metadata. It seeks to provide an extensive resource for religion, linguistics, and sociology studies. Moreover, it aims to support advancements in Artificial Intelligence (AI), such as Natural Language Processing and Speech Recognition technologies. Method. The development of the SAT dataset was conducted through four distinct phases: planning, creation and processing, measurement, and deployment. The SAT dataset contains a collection of 21,253 audio and corresponding transcript files that were successfully created. Advanced audio processing techniques were used to enhance speech recognition and provide a dataset that is suitable for wide-range use. Results. The fine-tuned SAT dataset achieved a 5.13% Word Error Rate (WER), indicating a significant improvement in accuracy compared to the baseline model of Microsoft Azure Speech. This achievement indicates the dataset’s quality and the employed processing techniques’ effectiveness. In light of this, a novel Closest Matching Phrase (CMP) algorithm was developed to enhance the high confidence of equivalent speech-to-text by adjusting lower ratio phrases. Conclusions. This research contributes significant impact and insight into different studies, such as religion, linguistics, and sociology, providing invaluable insights and resources. In addition, it is demonstrating its potential in Artificial Intelligence (AI) and supporting its applications. In future research, we will focus on enriching this dataset expansion by adding a sign language video corpus, using advanced alignment techniques. It will support ongoing Machine Translation (MT) developments for a broader understanding of Islamic Friday Sermons across different linguistics and cultures.
Author	Hassanin, M. A. Samah, A. A. Dimah, H. A.
Author_xml	– sequence: 1 givenname: A. A. surname: Samah fullname: Samah, A. A. – sequence: 2 givenname: H. A. surname: Dimah fullname: Dimah, H. A. – sequence: 3 givenname: M. A. surname: Hassanin fullname: Hassanin, M. A.
BookMark	eNqNkEFPg0AQhTemJtbaf-Bhj3pAd3bZXfC2LVBJsCSFJvW0gbIkGKQN0Jj-e6k1HrzoZeYw73uT967RqNk1BqFbIA_AueM8giDSYlTaFiV0GBaQCzSmDJglHGczQuMfxRWadt0bIQS4I8CWYxTM1mHkhcsFVjiZq0jNIh97KlWJn-IgXuFgFXrqFSf-6iVeJjgOsFp7YYzV0sOpv0nxXaLS-xt0WWZ1Z6bfe4LWgZ_On60oXoSDrbWlVPaWKUoqSsHyAoTr5DKj4OYAII1h1GWcy6K0c1a6ojSkcKEwQuaccQauEEw4bIL42ffQ7LPjR1bXet9W71l71ED0Vx_6lFaf0upTH5oOl4F7OnPbdtd1rSn1tuqzvto1fZtV9V-w_Qv-189PE3tuzA
CitedBy_id	crossref_primary_10_1016_j_jksuci_2024_102165
ContentType	Journal Article
DBID	AAYXX CITATION ADTOC UNPAY
DOI	10.15588/1607-3274-2024-2-10
DatabaseName	CrossRef Unpaywall for CDI: Periodical Content Unpaywall
DatabaseTitle	CrossRef
DatabaseTitleList	CrossRef
Database_xml	– sequence: 1 dbid: UNPAY name: Unpaywall url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/ sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISSN	2313-688X
ExternalDocumentID	10.15588/1607-3274-2024-2-10 10_15588_1607_3274_2024_2_10
GroupedDBID	9MQ AAYXX ADBBV ALMA_UNASSIGNED_HOLDINGS BCNDV CITATION GROUPED_DOAJ ADTOC UNPAY
ID	FETCH-LOGICAL-c227t-edf26f63bd1698b7a219b1117ee3293557df4b3f96fe0d91de67b535319663683
IEDL.DBID	UNPAY
ISSN	1607-3274 2313-688X
IngestDate	Sun Sep 07 11:24:27 EDT 2025 Thu Apr 24 23:05:08 EDT 2025 Tue Jul 01 03:16:44 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	2
Language	English
License	https://creativecommons.org/licenses/by-sa/4.0 cc-by-sa
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c227t-edf26f63bd1698b7a219b1117ee3293557df4b3f96fe0d91de67b535319663683
OpenAccessLink	https://proxy.k.utb.cz/login?url=https://doi.org/10.15588/1607-3274-2024-2-10
ParticipantIDs	unpaywall_primary_10_15588_1607_3274_2024_2_10 crossref_citationtrail_10_15588_1607_3274_2024_2_10 crossref_primary_10_15588_1607_3274_2024_2_10
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2024-06-27
PublicationDateYYYYMMDD	2024-06-27
PublicationDate_xml	– month: 06 year: 2024 text: 2024-06-27 day: 27
PublicationDecade	2020
PublicationTitle	Radìoelektronika, informatika, upravlìnnâ
PublicationYear	2024
SSID	ssj0001586147 ssib018208917 ssib015895113 ssib044757822
Score	2.2699742
Snippet	Context. Today, collecting and creating datasets in various sectors has become increasingly prevalent. Despite this widespread data production, a gap still...
SourceID	unpaywall crossref
SourceType	Open Access Repository Enrichment Source Index Database
StartPage	90
Title	BUILDING A SCALABLE DATASET FOR FRIDAY SERMONS OF AUDIO AND TEXT (SAT)
URI	https://doi.org/10.15588/1607-3274-2024-2-10
UnpaywallVersion	publishedVersion
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVAFT databaseName: Open Access Digital Library customDbUrl: eissn: 2313-688X dateEnd: 99991231 omitProxy: true ssIdentifier: ssib018208917 issn: 1607-3274 databaseCode: KQ8 dateStart: 19990101 isFulltext: true titleUrlDefault: http://grweb.coalliance.org/oadl/oadl.html providerName: Colorado Alliance of Research Libraries – providerCode: PRVAON databaseName: DOAJ (Directory of Open Access Journals) eJournal Collection customDbUrl: eissn: 2313-688X dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0001586147 issn: 2313-688X databaseCode: DOA dateStart: 19990101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2313-688X dateEnd: 99991231 omitProxy: true ssIdentifier: ssib044757822 issn: 1607-3274 databaseCode: M~E dateStart: 19990101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NT8IwFG8IHNSD30b8ID140MSSsW4dOxa2CQaGYSPB07Ju3UWCxECM_vW-wkbQxIiXXfb68vJe2_d-ad-vCN2wZlMYwqZEHfIQWImMxBqD5Q6pw7RZrFFL9Q73fdYZGY9jc1xC90UvzOb5vWkCOlMEaIQCdoJ46vAhqp-qwkCPVkaVkf_EnxWmKqSWb8k1KAEbxnmn3G9qvmWincV0Fn-8x5PJRnrxDlC_MGx1q-SlvpiLevL5g7NxW8sP0X5eZ2K-mhhHqCSnx2hvg33wBHmtUbfndP0HzHHQ5j3e6rnY4SEP3BADOMTesOvwZxy4Q9h4AzzwMIfycYC57-DQHYf4NuDh3SkaeW7Y7pD8WQWS6Lo1JzLNdJYxKtIGs5vCimHTErDlWVJSXdGtW2lmCJrZLJNaajdSySxh0uViZZQ16RkqT1-n8hxhk1ogCEk_MxVzGxSDjdSMQZuMYyOxkyqihYujJOccV09fTCKFPZSXIuWlSHkpUl6KdPhTRWQ9arbi3PhDvr6O3lYDLv474BLtLmOoMaJbV6g8f1vIayhG5qK2BPG1fCZ-ATVSxd0
linkProvider	Unpaywall
linkToUnpaywall	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NT8IwFG8IHNSD30b8Sg8eNLFkrFu3HQtsgoFh2EjgtKxbd5FMYiBG_3pfYRA0MeJll_W9vLz2faV9v4fQLbNtYQiHEnXJQ8ASGYk1BuYOocN0WKxRS_UO93zWHhpPI3NUQg-rXpjN-3vThOpMAaARCrUT7KcOH6L6qSoM-GhlVBn6z3ysaqrVqsUsuTolIMOo6JT7jc23SLQzz6fxx3s8mWyEF-8A9VaCLV-VvNTmM1FLPn9gNm4r-SHaL_JMzJcH4wiVZH6M9jbQB0-Q1xh2uq2O_4g5Dpq8yxtdF7d4yAM3xFAcYm_QafExDtwBON4A9z3MIX3sY-63cOiOQnwX8PD-FA09N2y2STFWgSS6bs2ITDOdZYyKtM4cW1gxOC0BLs-SkuoKbt1KM0PQzGGZ1FKnnkpmCZMujJVRZtMzVM5fc3mOsEktWAhBPzMVchskg_XUjIGbjGMjcZIqoisVR0mBOa5GX0wiVXsoLUVKS5HSUqS0FOnwp4rImmq6xNz4Y31tvXtbEVz8l-AS7S72UGNEt65QefY2l9eQjMzETXEGvwCXpMTo
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=BUILDING+A+SCALABLE+DATASET+FOR+FRIDAY+SERMONS+OF+AUDIO+AND+TEXT+%28SAT%29&rft.jtitle=Rad%C3%ACoelektronika%2C+informatika%2C+upravl%C3%ACnn%C3%A2&rft.au=Samah%2C+A.+A.&rft.au=Dimah%2C+H.+A.&rft.au=Hassanin%2C+M.+A.&rft.date=2024-06-27&rft.issn=1607-3274&rft.eissn=2313-688X&rft.issue=2&rft.spage=90&rft_id=info:doi/10.15588%2F1607-3274-2024-2-10&rft.externalDBID=n%2Fa&rft.externalDocID=10_15588_1607_3274_2024_2_10
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1607-3274&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1607-3274&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1607-3274&client=summon