Scalability and interactivity make Spark an excellent platform for data scientists who want to analyze very large datasets and build predictive models. However, the productivity of data scientists is hampered by lack of abstractions for building models for diverse types of data. For example, processing text or image data requires low level data coercion and transformation steps, which are not easy to compose into complex workflows for production applications. There is also a lack of domain specific libraries, for example for computer vision and image processing.
We present an open-source Spark library which simplifies common data science tasks such as feature construction and hyperparameter tuning, and allows data scientists to iterate and experiment on their models faster. The library integrates seamlessly with SparkML pipeline object model, and is installable through spark-packages.
The library brings deep learning and image processing to Spark through CNTK, OpenCV and Tensorflow in frictionless manner, thus enabling scenarios such training on GPU-enabled nodes, deep neural net featurization and transfer learning on large image datasets. We discuss the design and architecture of the library, and show examples of building a machine learning models for image classification.