Spark Deployment and Performance Evaluation on a Petascale HPC Setup

Slides PDF Video

Traditional HPC systems are designed according to the compute-centric paradigm, with focus on computing power, and the goal to process as many floating-point operations per second as possible. However, the growing importance of data-intensive applications is currently pushing the transition of many computing facilities into a data-centric paradigm, for which the variable to maximize is the amount of data, measured in records or bytes, processed per second to perform data analysis. The emergent focus on big data and the potential paradigm shift poses a dilemma to the managers of traditional HPC facilities, who have to choose between deploying dedicated systems for data analytics or to evolve their existing infrastructure to meet the new demands. We have studied the second option, adapting an existing HPC setup to host a massively parallel dataflow platform able to execute big data workloads. Among the different massively parallel dataflow frameworks, we have chosen Apache Spark. We have deployed Apache Spark 1.4.0 on a real-world, petascale, HPC setup, the MareNostrum supercomputer, built on top of commodity hardware. We have designed and developed a framework (Spark4MN) to efficiently run a Spark cluster over a Load Sharing Facility (LSF)-based environment and account for the hardware particularities of MareNostrum, such as GPFS storage, InfiniBand network, and multicore nodes. We have evaluated the behavior of two representative data-intensive applications (sorting and k-means). Especially for k-means, we show that MareNostrum’s performance is scalable and similar, if not better, than the top

Photo of Jordi Torres

About Jordi

Jordi Torres is a professor at UPC and research manager at BSC that explores the future of computing by performing high-level research in areas such as Big Data Analytics. Right now he also has a consultative and strategy role with a visionary task related to next generation technology. He is both a creative thinker and influential collaborator base on that he has worn many hats throughout his long career. He acts as an expert for various organizations and companies and mentors entrepreneurs. He is also a writer and collaborates with Spanish mass media and maintains a Blog