In this presentation we would like to present a new machine learning algorithm, Random Ferns, and the way it went from an initial publication in a scholarly paper to a fully functional and reusable Spark implementation released via Spark Packages. First of all, we shall demonstrate the algorithm itself along its initial implementation in R. We will present its strengths and weaknesses. It will be compared with well known Random Forest in terms of both correctness and training efficiency. Basing on that, some pieces of advice will be given on when the algorithm should be used. After that, the implementation part will be covered. The process of porting an R package to a Spark library will be described. Special emphasis will be placed upon the scalability issues that had to be resolved to enable the code to work effectively in distributed environment. We will also present the datasets we have used to evaluate our solution and the results we were able to achieve. Finally, the issues related to reusability and publication will be discussed, i.e. API design, submission to Spark packages, etc. We will also show how the library can be used in custom projects.
A Data Scientist with almost 5 year experience in Apache Hadoop ecosystem, mainly interested in large-scale research analytics. Developing code in Java, Apache Pig and Scala, occasionally in Python, focusing on applications of scalable machine learning techniques. Former lecturer at Polish Academy of Sciences (“Web-Scale Data Mining and Processing” e-learning course).
A Data Scientist from the University of Warsaw who loves binding research and great engineering craftsmanship. Specialises in scalable text and data mining with a focus on entity matching. Before joining the university, got a taste of big data at Microsoft and True Knowledge (now Amazon subsidiary). Currently uses Spark and R to predict how much milk would a cow give on a particular day basing on historical records. Enjoys snowboarding, loves Latin phrases.