Apache Hive has become de facto standard SQL on big data in Hadoop ecosystem. With its open architecture and backend neutrality, Hive queries can run on MapReduce and Tez. On the other hand, Apache Spark as an open-source data analytics cluster computing framework has gained significant momentum recently. Powering Hive with Spark, that is, introducing Spark as a new execution engine to Hive, has many benefits for both Spark users and Hive users. Hive on Spark (HIVE-7292) is probably the most watched project in Hive with 130+ watchers. The effort has attracted developers from both communities, around the globe, and from brand companies such as Intel, IBM, Cloudera, and MapR. This presentation covers the motivation, design principles, and architecture of the approach, with an emphasis on technical challenges that are posed to both Spark and Hive, such as YARN integration, resource scaling, user session management, etc. as well as the approaches we take and the tradeoffs we make to overcome these challenges. The presentation concludes with a status update of the project followed by a live demo.
Chao Sun is currently a Software Engineer at Cloudera, Inc. He has been working on Hive on Spark project since joining the company in mid 2014. Prior to that, he was a PhD student in Computer Science at University of Wisconsin-Milwaukee, focusing on type systems, mechanized proofs and programming languages.
Marcelo Vanzin has been working for more than a decade solving problems at every layer of the software stack, from kernel drivers to large distributed applications, and today, as part of Cloudera’s Spark team, is focused on enhancing Spark’s position as the execution engine of choice for distributed applications.