A method to enhance Apache Spark performance based on data segmentation and configuration parameters settings

When using modern big data processing tools, there is a problem of increasing the productivity of using modern frameworks in the context of effective setting of various configuration parameters. The object of the research is computational processes of processing big data with the use of technologies...

Full description

Saved in:
Bibliographic Details
Published inSučasnij stan naukovih doslìdženʹ ta tehnologìj v promislovostì (Online) no. 1 (27); pp. 128 - 139
Main Authors Minukhin, Serhii, Koptilov, Nikita
Format Journal Article
LanguageEnglish
Published 02.07.2024
Online AccessGet full text
ISSN2522-9818
2524-2296
2524-2296
DOI10.30837/ITSSI.2024.27.128

Cover

Abstract When using modern big data processing tools, there is a problem of increasing the productivity of using modern frameworks in the context of effective setting of various configuration parameters. The object of the research is computational processes of processing big data with the use of technologies of high-performance frameworks. The subject is methods and approaches to the effective setting of configuration parameters of frameworks in the conditions of limitations of virtualization environments and local resources. The purpose of the study is to improve the performance of Apache Spark and Apache Hadoop deployment modes based on a combined approach that includes preprocess segmentation of input data and setting of basic and additional configuration parameters that take into account the limitations of the virtual environment and local resources. Achieving the set goal involves the following tasks: create a synthesized set of WordCount test data for using input data segmentation methods. Determine the composition of general and specific Apache Spark and Apache Hadoop configuration parameters that most affect the performance of frameworks in Spark Standalone and Hadoop Yarn (FIFO) deployment modes. Justify changes in the values of the configuration parameters (accepted by default) by setting the level of parallelism, the number of partitions of the input file according to the number of processor cores, the number of tasks assigned to each core and the system executor. Conduct experimental research to substantiate theoretical results and prove their use in practice. Methods. The research used the following methods: statistical analysis; a method of generating test data based on defined segmentation characteristics with arbitrary volumes of data; a systematic approach for comprehensive evaluation and analysis of performance of frameworks based on selected configuration parameters. The results. On the basis of the developed system of parameters for evaluating the performance of the studied frameworks, experiments were carried out, which include: the application of the method of segmentation of input data based on the division of the input file into paragraphs (lines) for different values of the ranges of the number of words and the number of letters in each word; setting the main parameters and specific ones, in particular, partitioning and parallelism, taking into account the characteristics of the virtual environment and the local resource. According to the obtained results, a detailed analysis of the use of the proposed methods to improve the performance of the studied frameworks with recommendations for choosing the optimal values of data segmentation parameters and configuration parameters was carried out. You are snowmen. The obtained results of the experiments allow us to conclude that the use of the proposed methods of setting the configuration parameters of Spark and Hadoop will increase the processing productivity: for small files (0.5–1 GB) on average up to 25–30%, for large ones (1.5–2.5 GB ) – up to 10–20% on average. At the same time, the average value of the execution time of one task decreased by 10-15% for files of different sizes and with different number of words in a line.
AbstractList When using modern big data processing tools, there is a problem of increasing the productivity of using modern frameworks in the context of effective setting of various configuration parameters. The object of the research is computational processes of processing big data with the use of technologies of high-performance frameworks. The subject is methods and approaches to the effective setting of configuration parameters of frameworks in the conditions of limitations of virtualization environments and local resources. The purpose of the study is to improve the performance of Apache Spark and Apache Hadoop deployment modes based on a combined approach that includes preprocess segmentation of input data and setting of basic and additional configuration parameters that take into account the limitations of the virtual environment and local resources. Achieving the set goal involves the following tasks: create a synthesized set of WordCount test data for using input data segmentation methods. Determine the composition of general and specific Apache Spark and Apache Hadoop configuration parameters that most affect the performance of frameworks in Spark Standalone and Hadoop Yarn (FIFO) deployment modes. Justify changes in the values of the configuration parameters (accepted by default) by setting the level of parallelism, the number of partitions of the input file according to the number of processor cores, the number of tasks assigned to each core and the system executor. Conduct experimental research to substantiate theoretical results and prove their use in practice. Methods. The research used the following methods: statistical analysis; a method of generating test data based on defined segmentation characteristics with arbitrary volumes of data; a systematic approach for comprehensive evaluation and analysis of performance of frameworks based on selected configuration parameters. The results. On the basis of the developed system of parameters for evaluating the performance of the studied frameworks, experiments were carried out, which include: the application of the method of segmentation of input data based on the division of the input file into paragraphs (lines) for different values of the ranges of the number of words and the number of letters in each word; setting the main parameters and specific ones, in particular, partitioning and parallelism, taking into account the characteristics of the virtual environment and the local resource. According to the obtained results, a detailed analysis of the use of the proposed methods to improve the performance of the studied frameworks with recommendations for choosing the optimal values of data segmentation parameters and configuration parameters was carried out. You are snowmen. The obtained results of the experiments allow us to conclude that the use of the proposed methods of setting the configuration parameters of Spark and Hadoop will increase the processing productivity: for small files (0.5–1 GB) on average up to 25–30%, for large ones (1.5–2.5 GB ) – up to 10–20% on average. At the same time, the average value of the execution time of one task decreased by 10-15% for files of different sizes and with different number of words in a line.
Author Koptilov, Nikita
Minukhin, Serhii
Author_xml – sequence: 1
  givenname: Serhii
  orcidid: 0000-0002-9314-3750
  surname: Minukhin
  fullname: Minukhin, Serhii
– sequence: 2
  givenname: Nikita
  orcidid: 0009-0009-2109-8717
  surname: Koptilov
  fullname: Koptilov, Nikita
BookMark eNplkMtOwzAQRS1UJErpD7DyDyT4kYezrCoelSqxaPfRxB63gcaJbFeof09I2SBWM3NnzizOPZm53iEhj5ylkilZPm32u90mFUxkqShTLtQNmYtcZIkQVTGbepFUiqs7sgzhgzEmVFkwweekW9EO47E3NPYU3RGcRroaQB-R7gbwn3RAb3vfTYsGAhraO2ogAg146NBFiO2YgDNU9862h7O_JiMN42_0YbyMsXWH8EBuLZwCLn_rguxfnvfrt2T7_rpZr7aJVqVKbJ5rxtDoTGOZF41prKlAZZmuCssZ6kqoJrNMWiVRgrESq1LrBvg4M9bIBZHXt2c3wOULTqd68G0H_lJzVk_O6jaG0NY_zmpR1qOzkRJXSvs-BI_2PzSJ_gN9A9XadSg
ContentType Journal Article
DBID AAYXX
CITATION
ADTOC
UNPAY
DOI 10.30837/ITSSI.2024.27.128
DatabaseName CrossRef
Unpaywall for CDI: Periodical Content
Unpaywall
DatabaseTitle CrossRef
DatabaseTitleList CrossRef
Database_xml – sequence: 1
  dbid: UNPAY
  name: Unpaywall
  url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
Discipline Business
EISSN 2524-2296
EndPage 139
ExternalDocumentID 10.30837/itssi.2024.27.128
10_30837_ITSSI_2024_27_128
GroupedDBID AAYXX
ADBBV
ALMA_UNASSIGNED_HOLDINGS
BCNDV
CITATION
GROUPED_DOAJ
ADTOC
UNPAY
ID FETCH-LOGICAL-c878-f55c00edc4ce756bdbfd9a844c96f10ec928b4f03f83e3adf3e97ccba183e00b3
IEDL.DBID UNPAY
ISSN 2522-9818
2524-2296
IngestDate Sun Oct 26 03:55:35 EDT 2025
Tue Jul 01 04:10:16 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 1 (27)
Language English
License http://creativecommons.org/licenses/by-nc-sa/4.0
cc-by-nc-sa
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c878-f55c00edc4ce756bdbfd9a844c96f10ec928b4f03f83e3adf3e97ccba183e00b3
ORCID 0000-0002-9314-3750
0009-0009-2109-8717
OpenAccessLink https://proxy.k.utb.cz/login?url=https://doi.org/10.30837/itssi.2024.27.128
PageCount 12
ParticipantIDs unpaywall_primary_10_30837_itssi_2024_27_128
crossref_primary_10_30837_ITSSI_2024_27_128
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2024-07-02
PublicationDateYYYYMMDD 2024-07-02
PublicationDate_xml – month: 07
  year: 2024
  text: 2024-07-02
  day: 02
PublicationDecade 2020
PublicationTitle Sučasnij stan naukovih doslìdženʹ ta tehnologìj v promislovostì (Online)
PublicationYear 2024
SSID ssj0002876021
ssib044762074
ssib036251356
Score 2.263289
Snippet When using modern big data processing tools, there is a problem of increasing the productivity of using modern frameworks in the context of effective setting...
SourceID unpaywall
crossref
SourceType Open Access Repository
Index Database
StartPage 128
Title A method to enhance Apache Spark performance based on data segmentation and configuration parameters settings
URI https://doi.org/10.30837/itssi.2024.27.128
UnpaywallVersion publishedVersion
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 2524-2296
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0002876021
  issn: 2524-2296
  databaseCode: DOA
  dateStart: 20170101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 2524-2296
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssib044762074
  issn: 2522-9818
  databaseCode: M~E
  dateStart: 20170101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEF60gp58ixUte_CmiWl28zoGsVTBIthCPYXNZraWtmloU0R_vbNJqlUR9BbYIYSZj3lsZr4h5FwwFzBMNA2ppDI4WNwQiCQjdpnAA5GApaeR7ztuu8fv-k6_osnRszAr_-8ZZgfe1TBHbGAdZ3PT9kz0putkw3Uw766RjV7nIXzS2-MwiTACv7jMw2du2HbglhMyv7zkSxTaWqSZeH0R4_FKaGntlDuK5gUjoe4oGZmLPDbl2ze-xr999S7ZrjJMGpaQ2CNrkO6TzWWD-wGZhLRcG03zKYX0WZudhpkmdqaPmZiNaPY5TEB1lEvoNKW6lZTOYTCphpVSKtKEYjWthoNFCSOqecQnur9mjpJFQ_X8kHRbN93rtlEtXTCkjwWlchxpWZBILsFz3DiJVRIIn3MZuKppgQxsP-bKYspnwESiGASelLFA1wCWFbMjUkunKRwTGvDAh6TJuODAle0ILtGZKmgKW6BX8-rkYmmDKCupNSIsSQoFRrddDCaRVmBkexEqsE4uP8z0U7zQ96r4yf_ET0ktny3gDBOMPG4UhXmjwtc7xtPP5g
linkProvider Unpaywall
linkToUnpaywall http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEF60BT35Fisqe_CmiWl28zoGsVTBIthCPYXNZraWtmloU0R_vbNJqlUR9BbYIYSZj3lsZr4h5FwwFzBMNA2ppDI4WNwQiCQjdpnAA5GApaeR7ztuu8fv-k6_osnRszAr_-8ZZgfe1TBHbGAdZ3PT9kz0puuk7jqYd9dIvdd5CJ_09jhMIozALy7z8Jkbth245YTMLy_5EoU2F2kmXl_EeLwSWlrb5Y6iecFIqDtKRuYij0359o2v8W9fvUO2qgyThiUkdskapHtkY9ngvk8mIS3XRtN8SiF91manYaaJneljJmYjmn0OE1Ad5RI6TaluJaVzGEyqYaWUijShWE2r4WBRwohqHvGJ7q-Zo2TRUD0_IN3WTfe6bVRLFwzpY0GpHEdaFiSSS_AcN05ilQTC51wGrmpaIAPbj7mymPIZMJEoBoEnZSzQNYBlxeyQ1NJpCkeEBjzwIWkyLjhwZTuCS3SmCprCFujVvAa5WNogykpqjQhLkkKB0W0Xg0mkFRjZXoQKbJDLDzP9FC_0vSp-_D_xE1LLZws4xQQjj88qZL0D7O_O8Q
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+method+to+enhance+Apache+Spark+performance+based+on+data+segmentation+and+configuration+parameters+settings&rft.jtitle=Su%C4%8Dasnij+stan+naukovih+dosl%C3%ACd%C5%BEen%CA%B9+ta+tehnolog%C3%ACj+v+promislovost%C3%AC+%28Online%29&rft.au=Minukhin%2C+Serhii&rft.au=Koptilov%2C+Nikita&rft.date=2024-07-02&rft.issn=2522-9818&rft.eissn=2524-2296&rft.issue=1+%2827%29&rft.spage=128&rft.epage=139&rft_id=info:doi/10.30837%2FITSSI.2024.27.128&rft.externalDBID=n%2Fa&rft.externalDocID=10_30837_ITSSI_2024_27_128
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2522-9818&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2522-9818&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2522-9818&client=summon