Recent workload trends indicate rapid growth in the deployment of machine learning, genomics and scientific workloads using Apache Spark. However, efficiently running these applications on
cloud computing infrastructure like Amazon EC2 is challenging and we find that choosing the right hardware configuration can significantly
improve performance and cost. The key to address the above challenge is having the ability to predict performance of applications under
various resource configurations so that we can automatically choose the optimal configuration. We present Ernest, a performance prediction
framework for large scale analytics. Ernest builds performance models based on the behavior of the job on small samples of data and then
predicts its performance on larger datasets and cluster sizes. Our evaluation on Amazon EC2 using several workloads shows that our prediction error is low while having a training overhead of less than 5% for long-running jobs.
Shivaram Venkataraman is a fifth year PhD student at the University of California, Berkeley and works with Mike Franklin and Ion Stoica at the AMP Lab. He is a committer on the Apache Spark project and his research interests are in designing systems for large scale machine-learning. Before coming to Berkeley, he completed his M.S at the University of Illinois, Urbana-Champaign and worked as a Software Engineer at Google.