Beyond Collect and Parallelize for Tests

Slides PDF Video

As Spark jobs are used for more mission critical tasks, beyond exploration, it is important to have effective tools for testing. This talk expands on “Effective Testing For Spark Programs” (not required to have been seen) to discuss how to create large scale test jobs without depending on collect & parallelize which limit the sizes of datasets we can work with. Testing Spark Streaming jobs can be especially challenging, as the normal techniques for loading test data don’t work and additional work must be done to collect the results and stop streaming. We will explore the difficulties with testing Streaming Programs, options for setting up integration testing, beyond just local mode, with Spark, and also examine best practices for acceptance tests.

Holden Karau, Software Engineer at IBM

About Holden

Holden Karau is a software development engineer and is active in open source. She’s the co-author of “Learning Spark” and other Spark books and has taught Spark workshops. Prior to IBM, she worked on a variety of big data, search, and classification problems at Alpine, DataBricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelors of Mathematics in Computer Science.