Sqoop on Spark for Data Ingestion

Slides PDF Video

Apache Sqoop has been used primarily for transfer of data between relational databases and HDFS, leveraging the Hadoop Mapreduce engine. Recently the Sqoop community has made changes to allow data transfer across any two data sources represented in code by Sqoop connectors. For instance, it’s possible to use the latest Apache Sqoop to transfer data from MySQL to kafka or vice versa via the jdbc connector and kafka connector, respectively. This talk will focus on running Sqoop jobs on Apache Spark engine and proposed extensions to the APIs to use the Spark functionality. We’ll discuss the design options explored and implemented to submit jobs to the Spark engine. We’ll do a demo of one of the Sqoop job flows on Apache spark and how to use the Sqoop job APIs to monitor the Sqoop jobs. The talk will conclude use cases for Sqoop and Spark at Uber.

Photo of Veena Basavaraj

About Veena

Veena recently joined the data engineering team at Uber, focusing on stream processing solutions. She has worked at LinkedIn and at Cloudera in the past on various parts of the stack from front end, services, contributed to a couple of open source projects, and developed a keen interest in distributed systems such as Apache Kafka, Sqoop, and Apache Spark in the past year. As part of the ingest team at Cloudera, Veena was focusing on building solutions for batch and streaming ingestion and discovered the world of Apache Spark.

Photo of Vinoth Chandar

About Vinoth

Vinoth is currently an engineer on the big data platform team at Uber. Over the last six years, he has worked on various distributed data systems such as log based replication, HPC, and stream processing. More recently, as the lead engineer for Voldemort (key value store), he helped build out data-as-a-service at LinkedIn. At Uber, he has been primarily focused on building a real-time pipeline for ingesting database change logs into Hadoop for batch and stream processing.