Data scientists spend more time wrangling data than making models. Traditional tools like Pandas provide a very powerful data manipulation toolset. Transitioning to big data tools like PySpark allows one to work with much larger datasets, but can come at the cost of productivity.
In this session, learn about data wrangling in PySpark from the perspective of an experienced Pandas user. Topics will include best practices, common pitfalls, performance consideration and debugging.
Session hashtag: #SFds12
Dr. Andrew Ray is a Principal Data Engineer at Silicon Valley Data Science. He enjoys working at the intersection of engineering and data science. Andrew is an active contributor to the Apache Spark project. In his past life Andrew was a Data Scientist at Walmart, where he built an analytics platform on Hadoop that integrated data from multiple retail channels using fuzzy matching and graph algorithms. Andrew also led the adoption of Spark at Walmart from proof-of-concept to production. Andrew earned his Ph.D. in Mathematics from the University of Nebraska, where he worked on extremal graph theory.