SESSION

Fully Automated QA System For Large Scale Search And Recommendation Engines Using Spark

Slides PDF Video

“If you can not measure it, you can not improve it”. Relevancy of the documents retrieved by search and recommendation engines is crucial to the end users. Exposing end users to irrelevant documents is very expensive since those users will turn away; therefore, companies that rely on search services strive to improve their search algorithms. Whenever a tweak of an existing algorithm is done or a new algorithm is implemented, an assessment is required. Most of the existing techniques rely on running an A/B test by exposing a portion of the end users to a new search algorithm, then comparing the (Click Through Rate) CTR between the existing algorithm and the new one to measure the quality of each algorithm. In this talk we introduce a fully automated QA system for search and recommendation engines, which leverage implicit user feedback. The proposed system has been used successfully to assess CareerBuilder’s search engine. CareerBuilder operates the largest job board in the U.S. and has an extensive and growing global presence, with millions of job postings, more than 60 million actively-searchable resumes, over one billion searchable documents, and more than a million searches per hour. We implemented this system using Apache Spark. Spark enables us to derive implicit user feedback using about 19M search logs, then calculate the NDCG for different algorithms in less than 2 hours. We can report the estimated impact of a proposed changes in a few hours instead of running an A/B test and wait for days to figure out the impact. Given the size of search logs that we collect everyday, running this system in reasonable time requires a powerful distributed platform. We find Apache Spark as the best platform to fulfill our needs.

Khalifeh AlJadda, Data Scientist at Careerbuilder

About Khalifeh

Khalifeh AlJadda holds Ph.D. in computer science from the University of Georgia (UGA), with a specialization in machine learning. He has experience implementing large scale, distributed machine learning algorithms to solve challenging problems in domains ranging from Bioinformatics to search and recommendation engines. He is the lead data scientist on the search data science team at CareerBuilder, which is one of the largest job boards in the world. He leads the data science effort to design and implement the backend of CareerBuilder’s language-agnostic semantic search engine leveraging Apache Spark and the Hadoop ecosystem.