Baidu is one of the largest Internet companies in the world, with web traffic ranked 5th in the Alexa ranking. We were one of the early adopters of Spark, and we officially started to run Spark in production with Spark version 0.8.0. We have been steadily scaling up the clusters and adding more applications on top of the powerful Spark libraries. We have also contributed some patches back to the community, and will make our greatest efforts to embrace Spark even further. In this presentation, we will share with the community some of the benefits that we experienced and some of the lessons that we learned in deploying Spark. We customised and optimised the deployment model to fit our own computing environments. As a result, we were able to scale our Spark cluster up to 1000 nodes, with tens of thousands of jobs running on a daily basis. Some of our largest Spark jobs spans thousands of cores and utilizes more than 20T of RAM. By using our unified scheduler Normandy to manage Spark jobs, we successfully made Spark to run on our Hadoop clusters. We’ve also used SparkSQL to empower an interactive adhoc query engine to access Baidu’s internal data warehouse, which helps many analysts to quickly gain business insights by scanning hundreds of terabytes of data in seconds.
Hua Chai is an Architect of Baidu Infrastructure Department, in charge of distributed computing Technology. Hua began his career in AliYun’s Apsara system, then joined Baidu in 2010.11, and began to create a real-time streaming computing system from scratch, it running the largest cluster in China according to public information. Hua developed the first distributed report engine in China, applied in many important business productions of Baidu. Recently, Hua is working on Baidu Unified Computing Platform to meet Big Data demands in the company.