Experimental research of optimizing the Apache Spark tuning: RDD vs data frames

Minukhin S. V.; Novikov M.; Brynza N. O.; Sitnikov D. E.

Пожалуйста, используйте этот идентификатор, чтобы цитировать или ссылаться на этот ресурс: http://repository.hneu.edu.ua/handle/123456789/23444

Название:	Experimental research of optimizing the Apache Spark tuning: RDD vs data frames
Авторы:	Minukhin S. V. Novikov M. Brynza N. O. Sitnikov D. E.
Ключевые слова:	Apache Spark resilient distributed dataset Data Frames HDFS shuffling level of parallelism data processing data set data set application execution time
Дата публикации:	2020
Библиографическое описание:	Minukhin S. Experimental research of optimizing the Apache Spark tuning: RDD vs Data Frames / S. Minukhin, M. Novikov, N. Brynza, D. Sitnikov // Proceedings of The Third International Workshop on Computer Modeling and Intelligent Systems (CMIS-2020), April 27-May 1. - Zaporizhzhia, 2020. - PP. 409-425.
Краткий осмотр (реферат):	In this paper results and analysis of experimental research for determining the effectiveness of changing the parameters (as compared to standard values) of tuning Apache Spark for minimizing application execution time have been presented. The structure of a test dataset has been developed using RDD and Data Frames, based on which it is possible to create during a minimal time text files with a size greater than 4 GB having properties (characteristics) set up for testing. A peculiarity of test data is the fact that they often reflect basic properties of real world problems. The investigation includes 2 stages: at the first stage a comparative analysis of RDD and Data Frames is carried out for the standard settings of Apache Spark; at the second stage experiments for different sizes of an input test dataset for assessing the influence of parallelism levels, a block size in HDFS and the parameter spark.sql.shuffle.partitions in Spark Data Frames have been conducted. The obtained results substantiate the influence of the spark.sql.shuffle.partitions value on the test task execution performance. For this parameter ranges and change trends have been found. Also, levels of parallelism that maximally influence the execution time have been determined. It has been proven that for certain sizes of input test files the size of an HDFS block can be set up by default. Results of computational experiments have been demonstrated in tables and graphs. They confirm the effectiveness of the suggested changes to the Apache Spark settings as compared with the standard ones for different sizes of tested files.
URI (Унифицированный идентификатор ресурса):	http://repository.hneu.edu.ua/handle/123456789/23444
Располагается в коллекциях:	Статті (ІКТ)

Файлы этого ресурса:

Файл	Описание	Размер	Формат
paper31.pdf		503,34 kB	Adobe PDF	Просмотреть/Открыть

Показать полное описание ресурса Просмотр статистики

Все ресурсы в архиве электронных ресурсов защищены авторским правом, все права сохранены.