BUILDING A SCALABLE DATASET FOR FRIDAY SERMONS OF AUDIO AND TEXT (SAT)
Context. Today, collecting and creating datasets in various sectors has become increasingly prevalent. Despite this widespread data production, a gap still exists in specialized domains, particularly in the Islamic Friday Sermons (IFS) domain. It is rich with theological, cultural, and linguistic st...
Saved in:
| Published in | Radìoelektronika, informatika, upravlìnnâ no. 2; p. 90 |
|---|---|
| Main Authors | , , |
| Format | Journal Article |
| Language | English |
| Published |
27.06.2024
|
| Online Access | Get full text |
| ISSN | 1607-3274 2313-688X 2313-688X |
| DOI | 10.15588/1607-3274-2024-2-10 |
Cover
| Abstract | Context. Today, collecting and creating datasets in various sectors has become increasingly prevalent. Despite this widespread data production, a gap still exists in specialized domains, particularly in the Islamic Friday Sermons (IFS) domain. It is rich with theological, cultural, and linguistic studies that are relevant to Arab and Muslim countries, not just religious discourses.
Objective. The goal of this research is to bridge this lack by introducing a comprehensive Sermon Audio and Text (SAT) dataset with its metadata. It seeks to provide an extensive resource for religion, linguistics, and sociology studies. Moreover, it aims to support advancements in Artificial Intelligence (AI), such as Natural Language Processing and Speech Recognition technologies.
Method. The development of the SAT dataset was conducted through four distinct phases: planning, creation and processing, measurement, and deployment. The SAT dataset contains a collection of 21,253 audio and corresponding transcript files that were successfully created. Advanced audio processing techniques were used to enhance speech recognition and provide a dataset that is suitable for wide-range use.
Results. The fine-tuned SAT dataset achieved a 5.13% Word Error Rate (WER), indicating a significant improvement in accuracy compared to the baseline model of Microsoft Azure Speech. This achievement indicates the dataset’s quality and the employed processing techniques’ effectiveness. In light of this, a novel Closest Matching Phrase (CMP) algorithm was developed to enhance the high confidence of equivalent speech-to-text by adjusting lower ratio phrases.
Conclusions. This research contributes significant impact and insight into different studies, such as religion, linguistics, and sociology, providing invaluable insights and resources. In addition, it is demonstrating its potential in Artificial Intelligence (AI) and supporting its applications. In future research, we will focus on enriching this dataset expansion by adding a sign language video corpus, using advanced alignment techniques. It will support ongoing Machine Translation (MT) developments for a broader understanding of Islamic Friday Sermons across different linguistics and cultures. |
|---|---|
| AbstractList | Context. Today, collecting and creating datasets in various sectors has become increasingly prevalent. Despite this widespread data production, a gap still exists in specialized domains, particularly in the Islamic Friday Sermons (IFS) domain. It is rich with theological, cultural, and linguistic studies that are relevant to Arab and Muslim countries, not just religious discourses.
Objective. The goal of this research is to bridge this lack by introducing a comprehensive Sermon Audio and Text (SAT) dataset with its metadata. It seeks to provide an extensive resource for religion, linguistics, and sociology studies. Moreover, it aims to support advancements in Artificial Intelligence (AI), such as Natural Language Processing and Speech Recognition technologies.
Method. The development of the SAT dataset was conducted through four distinct phases: planning, creation and processing, measurement, and deployment. The SAT dataset contains a collection of 21,253 audio and corresponding transcript files that were successfully created. Advanced audio processing techniques were used to enhance speech recognition and provide a dataset that is suitable for wide-range use.
Results. The fine-tuned SAT dataset achieved a 5.13% Word Error Rate (WER), indicating a significant improvement in accuracy compared to the baseline model of Microsoft Azure Speech. This achievement indicates the dataset’s quality and the employed processing techniques’ effectiveness. In light of this, a novel Closest Matching Phrase (CMP) algorithm was developed to enhance the high confidence of equivalent speech-to-text by adjusting lower ratio phrases.
Conclusions. This research contributes significant impact and insight into different studies, such as religion, linguistics, and sociology, providing invaluable insights and resources. In addition, it is demonstrating its potential in Artificial Intelligence (AI) and supporting its applications. In future research, we will focus on enriching this dataset expansion by adding a sign language video corpus, using advanced alignment techniques. It will support ongoing Machine Translation (MT) developments for a broader understanding of Islamic Friday Sermons across different linguistics and cultures. |
| Author | Hassanin, M. A. Samah, A. A. Dimah, H. A. |
| Author_xml | – sequence: 1 givenname: A. A. surname: Samah fullname: Samah, A. A. – sequence: 2 givenname: H. A. surname: Dimah fullname: Dimah, H. A. – sequence: 3 givenname: M. A. surname: Hassanin fullname: Hassanin, M. A. |
| BookMark | eNqNkEFPg0AQhTemJtbaf-Bhj3pAd3bZXfC2LVBJsCSFJvW0gbIkGKQN0Jj-e6k1HrzoZeYw73uT967RqNk1BqFbIA_AueM8giDSYlTaFiV0GBaQCzSmDJglHGczQuMfxRWadt0bIQS4I8CWYxTM1mHkhcsFVjiZq0jNIh97KlWJn-IgXuFgFXrqFSf-6iVeJjgOsFp7YYzV0sOpv0nxXaLS-xt0WWZ1Z6bfe4LWgZ_On60oXoSDrbWlVPaWKUoqSsHyAoTr5DKj4OYAII1h1GWcy6K0c1a6ojSkcKEwQuaccQauEEw4bIL42ffQ7LPjR1bXet9W71l71ED0Vx_6lFaf0upTH5oOl4F7OnPbdtd1rSn1tuqzvto1fZtV9V-w_Qv-189PE3tuzA |
| CitedBy_id | crossref_primary_10_1016_j_jksuci_2024_102165 |
| ContentType | Journal Article |
| DBID | AAYXX CITATION ADTOC UNPAY |
| DOI | 10.15588/1607-3274-2024-2-10 |
| DatabaseName | CrossRef Unpaywall for CDI: Periodical Content Unpaywall |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | CrossRef |
| Database_xml | – sequence: 1 dbid: UNPAY name: Unpaywall url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/ sourceTypes: Open Access Repository |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering |
| EISSN | 2313-688X |
| ExternalDocumentID | 10.15588/1607-3274-2024-2-10 10_15588_1607_3274_2024_2_10 |
| GroupedDBID | 9MQ AAYXX ADBBV ALMA_UNASSIGNED_HOLDINGS BCNDV CITATION GROUPED_DOAJ ADTOC UNPAY |
| ID | FETCH-LOGICAL-c227t-edf26f63bd1698b7a219b1117ee3293557df4b3f96fe0d91de67b535319663683 |
| IEDL.DBID | UNPAY |
| ISSN | 1607-3274 2313-688X |
| IngestDate | Sun Sep 07 11:24:27 EDT 2025 Thu Apr 24 23:05:08 EDT 2025 Tue Jul 01 03:16:44 EDT 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 2 |
| Language | English |
| License | https://creativecommons.org/licenses/by-sa/4.0 cc-by-sa |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c227t-edf26f63bd1698b7a219b1117ee3293557df4b3f96fe0d91de67b535319663683 |
| OpenAccessLink | https://proxy.k.utb.cz/login?url=https://doi.org/10.15588/1607-3274-2024-2-10 |
| ParticipantIDs | unpaywall_primary_10_15588_1607_3274_2024_2_10 crossref_citationtrail_10_15588_1607_3274_2024_2_10 crossref_primary_10_15588_1607_3274_2024_2_10 |
| ProviderPackageCode | CITATION AAYXX |
| PublicationCentury | 2000 |
| PublicationDate | 2024-06-27 |
| PublicationDateYYYYMMDD | 2024-06-27 |
| PublicationDate_xml | – month: 06 year: 2024 text: 2024-06-27 day: 27 |
| PublicationDecade | 2020 |
| PublicationTitle | Radìoelektronika, informatika, upravlìnnâ |
| PublicationYear | 2024 |
| SSID | ssj0001586147 ssib018208917 ssib015895113 ssib044757822 |
| Score | 2.2699742 |
| Snippet | Context. Today, collecting and creating datasets in various sectors has become increasingly prevalent. Despite this widespread data production, a gap still... |
| SourceID | unpaywall crossref |
| SourceType | Open Access Repository Enrichment Source Index Database |
| StartPage | 90 |
| Title | BUILDING A SCALABLE DATASET FOR FRIDAY SERMONS OF AUDIO AND TEXT (SAT) |
| URI | https://doi.org/10.15588/1607-3274-2024-2-10 |
| UnpaywallVersion | publishedVersion |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAFT databaseName: Open Access Digital Library customDbUrl: eissn: 2313-688X dateEnd: 99991231 omitProxy: true ssIdentifier: ssib018208917 issn: 1607-3274 databaseCode: KQ8 dateStart: 19990101 isFulltext: true titleUrlDefault: http://grweb.coalliance.org/oadl/oadl.html providerName: Colorado Alliance of Research Libraries – providerCode: PRVAON databaseName: DOAJ (Directory of Open Access Journals) eJournal Collection customDbUrl: eissn: 2313-688X dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0001586147 issn: 2313-688X databaseCode: DOA dateStart: 19990101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2313-688X dateEnd: 99991231 omitProxy: true ssIdentifier: ssib044757822 issn: 1607-3274 databaseCode: M~E dateStart: 19990101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NT8IwFG8IHNSD30b8ID140MSSsW4dOxa2CQaGYSPB07Ju3UWCxECM_vW-wkbQxIiXXfb68vJe2_d-ad-vCN2wZlMYwqZEHfIQWImMxBqD5Q6pw7RZrFFL9Q73fdYZGY9jc1xC90UvzOb5vWkCOlMEaIQCdoJ46vAhqp-qwkCPVkaVkf_EnxWmKqSWb8k1KAEbxnmn3G9qvmWincV0Fn-8x5PJRnrxDlC_MGx1q-SlvpiLevL5g7NxW8sP0X5eZ2K-mhhHqCSnx2hvg33wBHmtUbfndP0HzHHQ5j3e6rnY4SEP3BADOMTesOvwZxy4Q9h4AzzwMIfycYC57-DQHYf4NuDh3SkaeW7Y7pD8WQWS6Lo1JzLNdJYxKtIGs5vCimHTErDlWVJSXdGtW2lmCJrZLJNaajdSySxh0uViZZQ16RkqT1-n8hxhk1ogCEk_MxVzGxSDjdSMQZuMYyOxkyqihYujJOccV09fTCKFPZSXIuWlSHkpUl6KdPhTRWQ9arbi3PhDvr6O3lYDLv474BLtLmOoMaJbV6g8f1vIayhG5qK2BPG1fCZ-ATVSxd0 |
| linkProvider | Unpaywall |
| linkToUnpaywall | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NT8IwFG8IHNSD30b8Sg8eNLFkrFu3HQtsgoFh2EjgtKxbd5FMYiBG_3pfYRA0MeJll_W9vLz2faV9v4fQLbNtYQiHEnXJQ8ASGYk1BuYOocN0WKxRS_UO93zWHhpPI3NUQg-rXpjN-3vThOpMAaARCrUT7KcOH6L6qSoM-GhlVBn6z3ysaqrVqsUsuTolIMOo6JT7jc23SLQzz6fxx3s8mWyEF-8A9VaCLV-VvNTmM1FLPn9gNm4r-SHaL_JMzJcH4wiVZH6M9jbQB0-Q1xh2uq2O_4g5Dpq8yxtdF7d4yAM3xFAcYm_QafExDtwBON4A9z3MIX3sY-63cOiOQnwX8PD-FA09N2y2STFWgSS6bs2ITDOdZYyKtM4cW1gxOC0BLs-SkuoKbt1KM0PQzGGZ1FKnnkpmCZMujJVRZtMzVM5fc3mOsEktWAhBPzMVchskg_XUjIGbjGMjcZIqoisVR0mBOa5GX0wiVXsoLUVKS5HSUqS0FOnwp4rImmq6xNz4Y31tvXtbEVz8l-AS7S72UGNEt65QefY2l9eQjMzETXEGvwCXpMTo |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=BUILDING+A+SCALABLE+DATASET+FOR+FRIDAY+SERMONS+OF+AUDIO+AND+TEXT+%28SAT%29&rft.jtitle=Rad%C3%ACoelektronika%2C+informatika%2C+upravl%C3%ACnn%C3%A2&rft.au=Samah%2C+A.+A.&rft.au=Dimah%2C+H.+A.&rft.au=Hassanin%2C+M.+A.&rft.date=2024-06-27&rft.issn=1607-3274&rft.eissn=2313-688X&rft.issue=2&rft.spage=90&rft_id=info:doi/10.15588%2F1607-3274-2024-2-10&rft.externalDBID=n%2Fa&rft.externalDocID=10_15588_1607_3274_2024_2_10 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1607-3274&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1607-3274&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1607-3274&client=summon |