PEARC17 has ended
Once you’ve registered and arrive in New Orleans, be sure to use our mobile web app to manage your busy schedule so you don’t miss a thing. Also check the website for updates and use the #PEARC17 hashtag to keep up with friends and colleagues.  
Back To Schedule
Monday, July 10 • 1:30pm - 5:00pm
How to Accelerate Your Big Data Applications with Hadoop and Spark

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Apache Hadoop and Spark are gaining prominence in handling Big Data and analytics. Recent studies have shown that default Hadoop and Spark can not leverage the high-performance networking and storage architectures on modern HPC clusters efficiently, like Remote Direct Memory Access (RDMA) enabled high-performance interconnects and heterogeneous and high-speed storage systems (e.g. HDD, SSD, NVMe-SSD, and Lustre). These middleware are traditionally written with sockets and do not deliver the best performance on modern high-performance networks. In this tutorial, we will provide an in-depth overview of the architecture of Hadoop components (HDFS, MapReduce, etc.) and Spark. We will examine the challenges in re-designing networking and I/O components of these middleware with modern interconnects, protocols (such as InfiniBand and RoCE) with RDMA and storage architectures. Using the publicly available software packages in the High-Performance Big Data (HiBD, http://hibd.cse.ohio-state.edu) project, we will provide case studies of the new designs for several Hadoop/Spark components and their associated benefits. Through these case studies, we will also examine the interplay between high-performance interconnects, high-speed storage systems, and multi-core platforms to achieve the best solutions for these components and Big Data applications on modern HPC clusters. This tutorial will provide hands-on sessions of Hadoop and Spark on SDSC Comet supercomputer.

Monday July 10, 2017 1:30pm - 5:00pm CDT
Strand 11B