San Francisco
June 30 - July 2, 2014


Spark Summit 2015e
Multi-modal big data analysis within the Spark ecosystem
Jordi Torres (Barcelona Supercomputing Center)

While the most part of big data systems target text-based analytics, multimedia data, which makes up about 2/3 of Internet traffic, provide unprecedented opportunities for understanding and responding to real world situations and challenges. However, performing a truly multi-modal analysis of data, involving for instance the visual content of photos uploaded to Instagram or Twitter, is not currently an easy task, unfeasible for the majority. On the one hand, the sophisticated algorithms for photo/video content analysis are still not part of the leader Big Data ecosystems such as Spark. On the other hand, processing these data imply a powerful and efficient hardware configuration. At the Barcelona Supercomputing Center (BSC) we are successfully performing multi-modal big data analytics with Apache Spark on an optimized HPC setup. As far as we know this is one of the first Spark HPC deployments, the first on a LSF-based supercomputer for sure. We have tuned Spark for the optimum usage of the MareNostrum supercomputer based in 3028 nodes, with a 500Gb local disk, 32 GB and 16 cores. In summary, 48,896 cores interconnected through a Infiniband (40Gb/s) and Ethernet, 94 TB of main memory. 2 PB of global GPFS storage and 1,5 PB distributed storage (in 3028 local disks with a bandwidth of 140MB/s). We are working on libraries for easing the execution of multi-modal big data analysis within the Spark ecosystem. One of the functionalities that these libraries will provide will be the possibility to import image files and metadata from multiple sources (e.g. Instagram and Twitter) in an efficient and unified way into Spark. We are working on a unified packaging format and metadata description that allows overcoming the “small files problem” and the heterogeneity of metadata formats. Another set of functionalities will apply visual recognition to extend the original metadata (e.g. geolocation, timestamp, title, etc.) with semantic information such as the description of features, relationships, actions and emotional information of people appearing in the content, as well as description of events, locations and objects. We are working on an optimized pipeline and a high-level API to enable executing these operations efficiently and without previous knowledge in computer vision. Tools for efficiently indexing and search the visual contents of the images are also planned. The libraries will also facilitate a more low-level processing, for instance providing functions for extracting Spark-compliant feature vectors from the content. Our extraordinary infrastructure allows us to test these libraries with amazing workloads, such as huge datasets of Instagram images. We have already performed tests with 200 nodes, i.e. 3200 cores and 6.4 TB RAM. We are realizing extensive resource usage analysis, performance and scalability evaluation tests over many different configurations such as, different problem size, processors affinity, different interconnection network, etc.

Jordi Torres is a professor at UPC Barcelona Tech. He has more than twenty five years of experience in R&D of advanced distributed and parallel systems in the High Performance Computing Group. His principal interest as a researcher is Processing and Analyzing Big Data in a Sustainable Cloud. He has about 150 research publications in journals, conferences and book chapters. In 2005 the Barcelona Supercomputing Center was founded and he was nominated as a Manager for Autonomic Systems and eBusiness Platforms research line. He acts as an expert on these topics for various organizations, companies and mentoring entrepreneurs