At Microsoft, data analysts and engineers have been using Cosmos, a data store that allows you to store, transform and compute very large datasets. Cosmos is very good and reliable at serving a certain set of scenarios where the user needs scheduled jobs to be run on the large dataset. Cosmos serves all use cases where the analyst does not need immediate response, like an offline query.
In this talk, I will show how our team helped its analysts and engineers, by implementing a new paradigm in the pipeline, interactive query and analytics over big data with Apache Spark and particularly the Spark SQL library. Through a series of MVP’s, we hosted Spark on Mesos running on Windows servers. We then connected the Spark cluster to our generic querying tool Avocado via the ODBC interface that the Thrift server provides.