A method to enhance Apache Spark performance based on data segmentation and configuration parameters settings

When using modern big data processing tools, there is a problem of increasing the productivity of using modern frameworks in the context of effective setting of various configuration parameters. The object of the research is computational processes of processing big data with the use of technologies...

Full description

Saved in:

Bibliographic Details
Published in	Sučasnij stan naukovih doslìdženʹ ta tehnologìj v promislovostì (Online) no. 1 (27); pp. 128 - 139
Main Authors	Minukhin, Serhii, Koptilov, Nikita
Format	Journal Article
Language	English
Published	02.07.2024
Online Access	Get full text
ISSN	2522-9818 2524-2296 2524-2296
DOI	10.30837/ITSSI.2024.27.128

Cover

Abstract	When using modern big data processing tools, there is a problem of increasing the productivity of using modern frameworks in the context of effective setting of various configuration parameters. The object of the research is computational processes of processing big data with the use of technologies of high-performance frameworks. The subject is methods and approaches to the effective setting of configuration parameters of frameworks in the conditions of limitations of virtualization environments and local resources. The purpose of the study is to improve the performance of Apache Spark and Apache Hadoop deployment modes based on a combined approach that includes preprocess segmentation of input data and setting of basic and additional configuration parameters that take into account the limitations of the virtual environment and local resources. Achieving the set goal involves the following tasks: create a synthesized set of WordCount test data for using input data segmentation methods. Determine the composition of general and specific Apache Spark and Apache Hadoop configuration parameters that most affect the performance of frameworks in Spark Standalone and Hadoop Yarn (FIFO) deployment modes. Justify changes in the values of the configuration parameters (accepted by default) by setting the level of parallelism, the number of partitions of the input file according to the number of processor cores, the number of tasks assigned to each core and the system executor. Conduct experimental research to substantiate theoretical results and prove their use in practice. Methods. The research used the following methods: statistical analysis; a method of generating test data based on defined segmentation characteristics with arbitrary volumes of data; a systematic approach for comprehensive evaluation and analysis of performance of frameworks based on selected configuration parameters. The results. On the basis of the developed system of parameters for evaluating the performance of the studied frameworks, experiments were carried out, which include: the application of the method of segmentation of input data based on the division of the input file into paragraphs (lines) for different values of the ranges of the number of words and the number of letters in each word; setting the main parameters and specific ones, in particular, partitioning and parallelism, taking into account the characteristics of the virtual environment and the local resource. According to the obtained results, a detailed analysis of the use of the proposed methods to improve the performance of the studied frameworks with recommendations for choosing the optimal values of data segmentation parameters and configuration parameters was carried out. You are snowmen. The obtained results of the experiments allow us to conclude that the use of the proposed methods of setting the configuration parameters of Spark and Hadoop will increase the processing productivity: for small files (0.5–1 GB) on average up to 25–30%, for large ones (1.5–2.5 GB ) – up to 10–20% on average. At the same time, the average value of the execution time of one task decreased by 10-15% for files of different sizes and with different number of words in a line.
AbstractList	When using modern big data processing tools, there is a problem of increasing the productivity of using modern frameworks in the context of effective setting of various configuration parameters. The object of the research is computational processes of processing big data with the use of technologies of high-performance frameworks. The subject is methods and approaches to the effective setting of configuration parameters of frameworks in the conditions of limitations of virtualization environments and local resources. The purpose of the study is to improve the performance of Apache Spark and Apache Hadoop deployment modes based on a combined approach that includes preprocess segmentation of input data and setting of basic and additional configuration parameters that take into account the limitations of the virtual environment and local resources. Achieving the set goal involves the following tasks: create a synthesized set of WordCount test data for using input data segmentation methods. Determine the composition of general and specific Apache Spark and Apache Hadoop configuration parameters that most affect the performance of frameworks in Spark Standalone and Hadoop Yarn (FIFO) deployment modes. Justify changes in the values of the configuration parameters (accepted by default) by setting the level of parallelism, the number of partitions of the input file according to the number of processor cores, the number of tasks assigned to each core and the system executor. Conduct experimental research to substantiate theoretical results and prove their use in practice. Methods. The research used the following methods: statistical analysis; a method of generating test data based on defined segmentation characteristics with arbitrary volumes of data; a systematic approach for comprehensive evaluation and analysis of performance of frameworks based on selected configuration parameters. The results. On the basis of the developed system of parameters for evaluating the performance of the studied frameworks, experiments were carried out, which include: the application of the method of segmentation of input data based on the division of the input file into paragraphs (lines) for different values of the ranges of the number of words and the number of letters in each word; setting the main parameters and specific ones, in particular, partitioning and parallelism, taking into account the characteristics of the virtual environment and the local resource. According to the obtained results, a detailed analysis of the use of the proposed methods to improve the performance of the studied frameworks with recommendations for choosing the optimal values of data segmentation parameters and configuration parameters was carried out. You are snowmen. The obtained results of the experiments allow us to conclude that the use of the proposed methods of setting the configuration parameters of Spark and Hadoop will increase the processing productivity: for small files (0.5–1 GB) on average up to 25–30%, for large ones (1.5–2.5 GB ) – up to 10–20% on average. At the same time, the average value of the execution time of one task decreased by 10-15% for files of different sizes and with different number of words in a line.
Author	Koptilov, Nikita Minukhin, Serhii
Author_xml	– sequence: 1 givenname: Serhii orcidid: 0000-0002-9314-3750 surname: Minukhin fullname: Minukhin, Serhii – sequence: 2 givenname: Nikita orcidid: 0009-0009-2109-8717 surname: Koptilov fullname: Koptilov, Nikita
BookMark	eNplkMtOwzAQRS1UJErpD7DyDyT4kYezrCoelSqxaPfRxB63gcaJbFeof09I2SBWM3NnzizOPZm53iEhj5ylkilZPm32u90mFUxkqShTLtQNmYtcZIkQVTGbepFUiqs7sgzhgzEmVFkwweekW9EO47E3NPYU3RGcRroaQB-R7gbwn3RAb3vfTYsGAhraO2ogAg146NBFiO2YgDNU9862h7O_JiMN42_0YbyMsXWH8EBuLZwCLn_rguxfnvfrt2T7_rpZr7aJVqVKbJ5rxtDoTGOZF41prKlAZZmuCssZ6kqoJrNMWiVRgrESq1LrBvg4M9bIBZHXt2c3wOULTqd68G0H_lJzVk_O6jaG0NY_zmpR1qOzkRJXSvs-BI_2PzSJ_gN9A9XadSg
ContentType	Journal Article
DBID	AAYXX CITATION ADTOC UNPAY
DOI	10.30837/ITSSI.2024.27.128
DatabaseName	CrossRef Unpaywall for CDI: Periodical Content Unpaywall
DatabaseTitle	CrossRef
DatabaseTitleList	CrossRef
Database_xml	– sequence: 1 dbid: UNPAY name: Unpaywall url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/ sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
Discipline	Business
EISSN	2524-2296
EndPage	139
ExternalDocumentID	10.30837/itssi.2024.27.128 10_30837_ITSSI_2024_27_128
GroupedDBID	AAYXX ADBBV ALMA_UNASSIGNED_HOLDINGS BCNDV CITATION GROUPED_DOAJ ADTOC UNPAY
ID	FETCH-LOGICAL-c878-f55c00edc4ce756bdbfd9a844c96f10ec928b4f03f83e3adf3e97ccba183e00b3
IEDL.DBID	UNPAY
ISSN	2522-9818 2524-2296
IngestDate	Sun Oct 26 03:55:35 EDT 2025 Tue Jul 01 04:10:16 EDT 2025
IsDoiOpenAccess	false
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	1 (27)
Language	English
License	http://creativecommons.org/licenses/by-nc-sa/4.0 cc-by-nc-sa
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c878-f55c00edc4ce756bdbfd9a844c96f10ec928b4f03f83e3adf3e97ccba183e00b3
ORCID	0000-0002-9314-3750 0009-0009-2109-8717
OpenAccessLink	https://proxy.k.utb.cz/login?url=https://doi.org/10.30837/itssi.2024.27.128
PageCount	12
ParticipantIDs	unpaywall_primary_10_30837_itssi_2024_27_128 crossref_primary_10_30837_ITSSI_2024_27_128
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2024-07-02
PublicationDateYYYYMMDD	2024-07-02
PublicationDate_xml	– month: 07 year: 2024 text: 2024-07-02 day: 02
PublicationDecade	2020
PublicationTitle	Sučasnij stan naukovih doslìdženʹ ta tehnologìj v promislovostì (Online)
PublicationYear	2024
SSID	ssj0002876021 ssib044762074 ssib036251356
Score	2.263289
Snippet	When using modern big data processing tools, there is a problem of increasing the productivity of using modern frameworks in the context of effective setting...
SourceID	unpaywall crossref
SourceType	Open Access Repository Index Database
StartPage	128
Title	A method to enhance Apache Spark performance based on data segmentation and configuration parameters settings
URI	https://doi.org/10.30837/itssi.2024.27.128
UnpaywallVersion	publishedVersion
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 2524-2296 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0002876021 issn: 2524-2296 databaseCode: DOA dateStart: 20170101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2524-2296 dateEnd: 99991231 omitProxy: true ssIdentifier: ssib044762074 issn: 2522-9818 databaseCode: M~E dateStart: 20170101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEF60gp58ixUte_CmiWl28zoGsVTBIthCPYXNZraWtmloU0R_vbNJqlUR9BbYIYSZj3lsZr4h5FwwFzBMNA2ppDI4WNwQiCQjdpnAA5GApaeR7ztuu8fv-k6_osnRszAr_-8ZZgfe1TBHbGAdZ3PT9kz0putkw3Uw766RjV7nIXzS2-MwiTACv7jMw2du2HbglhMyv7zkSxTaWqSZeH0R4_FKaGntlDuK5gUjoe4oGZmLPDbl2ze-xr999S7ZrjJMGpaQ2CNrkO6TzWWD-wGZhLRcG03zKYX0WZudhpkmdqaPmZiNaPY5TEB1lEvoNKW6lZTOYTCphpVSKtKEYjWthoNFCSOqecQnur9mjpJFQ_X8kHRbN93rtlEtXTCkjwWlchxpWZBILsFz3DiJVRIIn3MZuKppgQxsP-bKYspnwESiGASelLFA1wCWFbMjUkunKRwTGvDAh6TJuODAle0ILtGZKmgKW6BX8-rkYmmDKCupNSIsSQoFRrddDCaRVmBkexEqsE4uP8z0U7zQ96r4yf_ET0ktny3gDBOMPG4UhXmjwtc7xtPP5g
linkProvider	Unpaywall
linkToUnpaywall	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEF60BT35Fisqe_CmiWl28zoGsVTBIthCPYXNZraWtmloU0R_vbNJqlUR9BbYIYSZj3lsZr4h5FwwFzBMNA2ppDI4WNwQiCQjdpnAA5GApaeR7ztuu8fv-k6_osnRszAr_-8ZZgfe1TBHbGAdZ3PT9kz0puuk7jqYd9dIvdd5CJ_09jhMIozALy7z8Jkbth245YTMLy_5EoU2F2kmXl_EeLwSWlrb5Y6iecFIqDtKRuYij0359o2v8W9fvUO2qgyThiUkdskapHtkY9ngvk8mIS3XRtN8SiF91manYaaJneljJmYjmn0OE1Ad5RI6TaluJaVzGEyqYaWUijShWE2r4WBRwohqHvGJ7q-Zo2TRUD0_IN3WTfe6bVRLFwzpY0GpHEdaFiSSS_AcN05ilQTC51wGrmpaIAPbj7mymPIZMJEoBoEnZSzQNYBlxeyQ1NJpCkeEBjzwIWkyLjhwZTuCS3SmCprCFujVvAa5WNogykpqjQhLkkKB0W0Xg0mkFRjZXoQKbJDLDzP9FC_0vSp-_D_xE1LLZws4xQQjj88qZL0D7O_O8Q
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+method+to+enhance+Apache+Spark+performance+based+on+data+segmentation+and+configuration+parameters+settings&rft.jtitle=Su%C4%8Dasnij+stan+naukovih+dosl%C3%ACd%C5%BEen%CA%B9+ta+tehnolog%C3%ACj+v+promislovost%C3%AC+%28Online%29&rft.au=Minukhin%2C+Serhii&rft.au=Koptilov%2C+Nikita&rft.date=2024-07-02&rft.issn=2522-9818&rft.eissn=2524-2296&rft.issue=1+%2827%29&rft.spage=128&rft.epage=139&rft_id=info:doi/10.30837%2FITSSI.2024.27.128&rft.externalDBID=n%2Fa&rft.externalDocID=10_30837_ITSSI_2024_27_128
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2522-9818&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2522-9818&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2522-9818&client=summon