Home » Java » java – Spark job takes a long time to read data from HDFS file using wholeTextFiles-Exceptionshub

java – Spark job takes a long time to read data from HDFS file using wholeTextFiles-Exceptionshub

Posted by: admin February 25, 2020 Leave a comment

Questions:

I want to process data from the HDFS file using the Spark Java code. While processing files, I am performing simple transformation such as replace a new line with space and find patterns using regex from the file. I used the wholeTextFiles method to read data from HDFS files but it took 2 hours to process only 4 MB files.

I tried to increase spark executor memory to 15g with 4 executor instances still it took 2 hours.
I have 1 master with 56GiB memory,8 cores, and 3 workers with 28 GiB memory,8 cores.
How to improve the performance of the spark job using these nodes configurations.

Thanks,

How to&Answers: