Pivoting Data with SparkSQL

Slides PDF Video

Pivot tables are an essential part of data analysis and reporting. A pivot can be thought of as translating rows into columns while applying one or more aggregations. Many popular data manipulation tools (pandas, reshape2, and Excel) and databases (MS SQL and Oracle 11g) include the ability to pivot data. Now with the release of Spark 1.6 pivot is a part of the DataFrame API. We discuss how to use it and go over real world examples.

Andrew Ray, Senior Data Engineer at Silicon Valley Data Science

About Andrew

Dr. Andrew Ray is a Senior Data Engineer at Silicon Valley Data Science. He’s passionate about big data and has extensive experience working with Spark. Andrew is an active contributor to the Apache Spark project including SparkSQL and GraphX. Prior to joining SVDS, Andrew was a Data Scientist at Walmart, where he built an analytics platform on Hadoop that integrated data from multiple retail channels using fuzzy matching. Andrew also led the adoption of Spark at Walmart from proof-of-concept to production. Andrew earned his Ph.D. in Mathematics from the University of Nebraska, where he worked on extremal graph theory.