Both Spark and HBase are widely used, but how to use them together with high performance and simplicity is a very hard topic. Spark HBase Connector (SHC) provides feature-rich and efficient access to HBase through Spark SQL. It bridges the gap between the simple HBase key value store and complex relational SQL queries, and enables users to perform complex data analytics on top of HBase using Spark.
SHC implements the standard Spark data source APIs, and leverages the Spark catalyst engine for query optimization. To achieve high performance, SHC constructs the RDD from scratch instead of using the standard HadoopRDD. With the customized RDD, all critical techniques can be applied and fully implemented, such as partition pruning, column pruning, predicate pushdown and data locality. The design makes the maintenance very easy, while achieving a good tradeoff between performance and simplicity. Also, SHC has integrated natively with Phoenix data types. With SHC, Spark can execute batch jobs to read/write data from/into Phoenix tables. Phoenix can also read/write data from/into HBase tables created by SHC. For example, users can run a complex SQL query on top of an HBase table created by Phoenix inside Spark, perform a table join against a DataFrame which reads the data from a Hive table, or integrate with Spark Streaming to implement a more complicated system.
This session will demonstrate how SHC works, how to use SHC in secure/non-secure clusters, how SHC works with multi-HBase clusters and how Spark reads/writes data from/into Phoenix tables with SHC, etc. It will also benefit people who use Spark and other data sources (besides HBase) as it inspires them with ideas of how to support high performance data source access at the Spark DataFrame level.
Session hashtag: #SFeco10
Mingjie Tang is an engineer at Hortonworks. He is working on SparkSQL, Spark MLlib and Spark Streaming. He has broad research interest in database management system, similarity query processing, data indexing, big data computation, data mining and machine learning. Mingjie completed his PhD in Computer Science from Purdue University.
Weiqing has been working in the Apache Hadoop ecosystem since 2015 and is a Spark/HBase/Ambari/Hadoop contributor. She is currently a software engineer in Spark team at Hortonworks. Before that, she obtained a Master’s Degree in Computational Data Science from Carnegie Mellon University. In 2011-2013, she was a software engineer at Schlumberger. At that time, she was working on a real-time acquisition system designed for field engineers to acquire and process various types of underground data.