This talk is meant for Application Developers to get an overview of applications built with Spark and the lessons learned. The talk will cover 3-4 applications. Two are outlined here: Logs Analysis: – Using Spark for rapid development of logs analysis. – Utilizing various Databricks Cloud Visualizations. – Creating Dashboards for sharing with others. – Scheduling a Batch Job. – Instrumenting monitoring of a job with Spark SQL to help indicate if a batch job code needs updating. Wikipedia Analysis: – Using Spark’s parallelism to download the Wikipedia dataset faster. – Best practices for working with XML data. – Working around unevenly sharded data. – Exploring data interactively with Spark. Possible ideas for other applications to cover: – Covering API data (Facebook, Twitter, Github). – Indexing a dataset to ElasticSearch. (This would demonstrate tips of how to write a large amount data at once to ElasticSearch.)
Vida is currently a Solutions Engineer at Databricks where her job is making sure customers get the most out of Databricks Cloud. In her past, she worked on scaling Square’s Reporting Analytics System. She first began working with distributed computing at Google – where she improved search rankings of mobile specific web content and built and tuned language models for speech recognition using a year’s worth of Google search queries. She’s passionate about accelerating the adoption of Apache Spark to bring the combination of speed and scale of data processing to the mainstream.