The Naive Bayes classifier provided by Apache Spark out of the box is quite limited. It was intended primarily for document classification. It does not automatically handle continuous features, strings, null values, or any sort of visualization of the resulting model. In this session, you’ll learn about changes made to allow it to work on datasets with any sort of column types. More importantly, you’ll get a demonstration of how the model can be visualized in order to gain insight, and interactively apply it to new data.
Hear how, in order to bin continuous columns, ESI Group employs the open source Minimum Description Length Principal (MDLP) library developed by Sergio Ramirez based on a paper by Fayyad and Irani. It is capable of using Spark to do entropy-based binning on many columns simultaneously with respect to a specific target. Once continuous columns have been binned and indexed, they can be processed by Spark’s Naive Bayes algorithm.
Session hashtag: #SFml4
Barry G Becker loves to create software to visualize information. He has authored over a dozen journal papers in scientific and information visualization, and is an inventor of 5 US patents. He worked initially as a researcher at Lawrence Livermore National Laboratory, and later as a software professional at various start-up ventures and established software companies, most recently at ESI Group. Barry was one of the original members of the MineSet team back in the nineties, and rejoined the team in 2014 to help develop a cloud version of the product using web technologies and Spark.