At Facebook, millions of Hive queries are executed on a daily basis, and the workload contributes to important analytics that drive product decisions and insights. Spark SQL in Apache Spark provides much of the same functionality as Hive query language (HQL) more efficiently, and Facebook is building a framework to migrate existing production Hive workload to Spark SQL with minimal user intervention.
Before Facebook began large-scale migration to SparkSQL, they worked on identifying the gap between HQL and SparkSQL. They built an offline syntax analysis tool that parses, analyzes, optimizes and generates physical plans on daily HQL workload. In this session, they’ll share their results. After finding their syntactic analysis encouraging, they built tooling for offline semantic analysis where they run HQL queries in their Spark shadow cluster and validate the outputs. Output validation is necessary since the runtime behavior in Spark SQL may be different from HQL. They have built a migration framework that supports HQL in both Hive and Spark execution engines, can shadow and validate HQL workloads in Spark, and makes it easy for users to convert their workloads.
Session hashtag: #SFdev8
Jie Xiong is a Software Engineer at Facebook, where she works in Ads Data Infra team, focusing large-scale data storage and processing that powering Facebook Ads. She obtained her PhD from University of Illinois, and is interested in High-Performance Computation and Large Scale Data Processing.
Zhan Zhang is a Software Engineer at Facebook, where he is on the Data Infra group, focusing on large scale distributed systems, especially Apache Spark in production. He obtained his PhD from University of Florida, and is interested in Distributed System and Large Scale machine learning. Zhan is an active contributor to several Apache projects, such as Apache Spark, Yarn, HBase, etc, and has presented his work in Hadoop Summit (Dublin 2016), and HBaseCon (San Francisco, 2016).