The 7th International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics

May 21, 2018
Vancouver, British Columbia CANADA

In Conjunction with 32nd IEEE International Parallel & Distributed Processing Symposium
May 21-25, 2018
JW Marriott Parq Vancouver
Vancouver, British Columbia CANADA
IPDPS 2018 logo

Advance Program

Time Title Authors/Speaker
8:15-8:30am Opening remarks
8:30-9:30am Invited Talk 1: Scaling Deep Learning Algorithms on Extreme Scale Architectures Abhinav Vishnu, Principal member of technical staff, AMD, USA
9:30-10:00am Break
10:00-10:30am Near-Optimal Straggler Mitigation for Distributed Gradient Methods (ParLearning-01) Songze Li, Seyed Mohammadreza Mousavi Kalan, A. Salman Avestimehr and Mahdi Soltanolkotabi
10:30-11:00am Streaming Tiles: Flexible Implementation of Convolution Neural Networks Inference on Manycore Architectures (ParLearning-02) Nesma Rezk, Madhura Purnaprajna and Zain Ul-Abdin
11:00-12:0pm Invited Talk 2: Model Parallelism optimization with deep reinforcement learning Azalia Mirhoseini, Google Brain, USA
12:00-1:30pm Lunch
1:30-2:30pm Invited Talk 3: Introduction to Snap Machine Learning Thomas Parnell, IBM Research – Zurich, Switzerland
2:30-3:00pm Parallel Huge Matrix Multiplication on a Cluster with GPGPU Accelerators (ParLearning-03) Seungyo Ryu and Dongseung Kim
3:00-3:30pm Break
3:30-4:00pm Invited Talk 4: Matrix Factorization on GPUs: A Tale of Two Algorithms Wei Tan, Citadel LLC, USA
4:00-4:30pm A Study of Clustering Techniques and Hierarchical Matrix Formats for Kernel Ridge Regression (ParLearning-04) Elizaveta Rebrova, Gustavo Chávez, Yang Liu, Pieter Ghysels and Xiaoye Sherry Li
4:30-5:00pm Panel Discussion Azalia Mirhoseini, Thomas Parnell, Wei Tan

Invited talk 1

Abhinav Vishnu, Principal member of technical staff, AMD, USA

Scaling Deep Learning Algorithms on Extreme Scale Architectures

Abstract: Deep Learning (DL) is ubiquitous. Yet leveraging distributed memory systems for DL algorithms is incredibly hard. In this talk, we will present approaches to bridge this critical gap. We will start by scaling DL algorithms on large scale systems such as leadership class facilities (LCFs). Specifically, we will: 1) present our TensorFlow and Keras runtime extensions which require negligible changes in user-code for scaling DL implementations, 2) present communication-reducing/avoiding techniques for scaling DL implementations, 3) present approaches on fault tolerant DL implementations, and 4) present research on semi-automatic pruning of DNN topologies. We will provide pointers and discussion on the general availability of our research under the umbrella of Machine Learning Toolkit on Extreme Scale (MaTEx) available at

Bio: Abhinav Vishnu is a Principal Member of Technical Staff at AMD Research. He focuses on designing extreme scale Deep Learning algorithms that are capable of execution on supercomputers and cloud computing systems. The specific objectives are to design user-transparent distributed TensorFlow; novel communication reducing/approximation techniques for DL algorithms; fault tolerant Deep Learning/Machine Learning algorithms; multi-dimensional deep neural networks and applications of these techniques on several domains. His research is publicly available as Machine Learning Toolkit for Extreme Scale (MaTEx) at

Invited talk 2

Azalia Mirhoseini, Google Brain, USA

Model Parallelism optimization with deep reinforcement learning

Abstract: The past few years have witnessed a growth in size and computational requirements for training and inference with neural networks. Currently, a common approach to address these requirements is to use a heterogeneous distributed environment with a mixture of hardware devices such as CPUs and GPUs. Importantly, the decision of placing parts of the neural models on devices is often made by human experts based on simple heuristics and intuitions. In this talk, I will present some of our recent efforts on learning to optimize model parallelism for TensorFlow computational graphs. Key to our method is the use of deep reinforcement learning to predict which subsets of operations in a TensorFlow graph should run on which of the available devices. The execution time of the predicted placements is then used as the reward signal to optimize the parameters of the deep model. Our main result is that on important computer vision, language modeling and neural machine translation tasks, our model finds non-trivial ways to parallelise the model that outperform hand-crafted heuristics and traditional algorithmic methods.

Bio: Azalia Mirhoseini is a Research Scientist at Google Brain where she focuses on machine learning approaches to solve problems in computer systems and metalearning. Before Google, she was a Ph.D. student in Electrical and Computer Engineering at Rice University. Her work has been published at several conference and journal venues, including ICML, ICLR, DAC, ICCAD, SIGMETRICS, IEEE TNNLS, and ACM TRETS. She has received a number of awards, including the Best Ph.D. Thesis Award at Rice, fellowships from IBM Research, Microsoft Research, Schlumberger, and a Gold Medal in the National Math Olympiad in Iran.

Invited talk 3

Thomas Parnell, IBM Research – Zurich, Switzerland

Introduction to Snap Machine Learning

Abstract: Generalized linear models, such as logistic regression and support vector machines, remain some of the most widely-used techniques in the machine learning field. Their enduring popularity can be attributed to their desirable theoretical properties, effective training algorithms, and relative ease of interpretability. In this talk we will introduce Snap Machine Learning: a new library for fast training of such models, that is designed to enable new real-time and large-scale applications. The library was designed from the ground up with performance in mind. It exploits parallelism at three different levels: across multiple machines in a network, across heterogeneous compute nodes within a machine (e.g. CPU and GPU), as well as the massive parallelism offered by modern GPUs. In this talk we will review this new architecture and give examples of how the library can be used via the various APIs that are provided (e.g. Python, Apache Spark, MPI). Finally, we will present benchmarking results using the publicly available Terabyte Click Logs dataset (from Criteo Labs) and show that Snap Machine Learning can train a logistic regression classifier in 1.53 minutes, 46x faster than any of the results that have been previously reported using the same dataset.

Bio: Thomas received his B.Sc. and Ph.D. degrees in mathematics from the University of Warwick. U.K., in 2006 and 2011, respectively. He joined Arithmatica, Warwick, U.K., in 2005, where he was involved in FPGA design and electronic design automation. In 2007, he co-founded Siglead Europe, a U.K.-limited subsidiary of Yokohama-based Siglead Inc., where he was involved in developing signal processing and error-correction algorithms for HDD, flash, and emerging storage technologies. In 2013, he joined IBM Research in Zürich, Switzerland, where he is actively involved in the research and development of machine learning, compression and error-correction algorithms for IBM’s storage and AI products. His research interests include signal processing, information theory, machine learning and recommender systems.

Invited talk 4

Wei Tan, Senior Research Engineer, Citadel LLC

Matrix Factorization on GPUs: A Tale of Two Algorithms

Abstract: Matrix factorization (MF) is an approach to derive latent features from observations. It is at the heart of many algorithms, e.g., collaborative filtering, word embedding and link prediction. Alternating least Square (ALS) and stochastic gradient descent (SGD) are the two popular methods in solving MF. SGD converges fast, while ALS is easy to parallelize and able to deal with non-sparse ratings. In this talk, I will introduce cuMF(, a CUDA-based matrix factorization library that accelerates both ALS and SGD to solve very large-scale MF. cuMF uses a set of techniques to maximize the performance on single and multiple GPUs. These techniques include smart access of sparse data leveraging memory hierarchy, using data parallelism with model parallelism, approximate algorithms and storage. With only a single machine with up to four Nvidia GPU cards, cuMF can be 10 times as fast, and 100 times as cost-efficient, compared with the state-of-art distributed CPU solutions. In this talk I will also share lessons learned in accelerating compute- and memory-intensive kernels on GPUs.

Bio: Dr. WEI TAN is Senior Research Engineer at Citadel LLC. Before joining Citadel he was a Research Staff Member at IBM T.J. Watson Research Center. Wei has a wide range of research interests in distributed computing, machine learning, and GPU computing. Specifically, he worked on GPU accelerated platform for large-scale machine learning. He developed cuMF, by far the fastest matrix factorization library on GPUs. His work has been incorporated into IBM patent portfolio and software products such as Spark, BigInsights and Cognos. He received the IEEE Peter Chen Big Data Young Researcher Award (2016), IBM Outstanding Technical Achievement Award (2017, 2016, 2014), Best Paper Award at IEEE SCC (2017, 2011) and ACM/IEEE ccGrid (2015), Best Student Paper Award at IEEE ICWS (2014), Pacesetter Award from Argonne National Laboratory (2010), and caBIG Teamwork Award from the National Institute of Health (2008). He held adjunct professor positions at Tsinghua University and Tianjin University. For more information, please visit

Call for Papers

Scaling up machine-learning (ML), data mining (DM) and reasoning algorithms from Artificial Intelligence (AI) for massive datasets is a major technical challenge in the time of "Big Data". The past ten years have seen the rise of multi-core and GPU based computing. In parallel and distributed computing, several frameworks such as OpenMP, OpenCL, and Spark continue to facilitate scaling up ML/DM/AI algorithms using higher levels of abstraction. We invite novel works that advance the trio-fields of ML/DM/AI through development of scalable algorithms or computing frameworks. Ideal submissions should describe methods for scaling up X using Y on Z, where potential choices for X, Y and Z are provided below.

Scaling up

  • Recommender systems
  • Optimization algorithms (gradient descent, Newton methods)
  • Deep learning
  • Sampling/sketching techniques
  • Clustering (agglomerative techniques, graph clustering, clustering heterogeneous data)
  • Classification (SVM and other classifiers)
  • SVD and other matrix computations
  • Probabilistic inference (Bayesian networks)
  • Logical reasoning
  • Graph algorithms/graph mining and knowledge graphs
  • Semi-supervised learning
  • Online/streaming learning
  • Generative adversarial networks


  • Parallel architectures/frameworks (OpenMP, OpenCL, OpenACC, Intel TBB)
  • Distributed systems/frameworks (GraphLab, Hadoop, MPI, Spark)
  • Machine learning frameworks (TensorFlow, PyTorch, Theano, Caffe)


  • Clusters of conventional CPUs
  • Many-core CPU (e.g. Xeon Phi)
  • FPGA
  • Specialized ML accelerators (e.g. GPU and TPU)

Proceedings of the Parlearning workshop will be distributed at the conference and will be submitted for inclusion in the IEEE Xplore Digital Library after the conference.

PDF Flyer


Best Paper Award: The program committee will nominate a paper for the Best Paper award. In past years, the Best Paper award included a cash prize. Stay tuned for this year!

Travel awards: Students with accepted papers have a chance to apply for a travel award. Please find details on the IEEE IPDPS web page.

Important Dates

  • Paper submission: February 16, 2018 AoE
  • Notification: March 16 March 9, 2018
  • Camera Ready: March 30 March 16, 2018

Paper Guidelines

Submitted manuscripts should be upto 10 single-spaced double-column pages using 10-point size font on 8.5x11 inch pages (IEEE conference style), including figures, tables, and references. Format requirements are posted on the IEEE IPDPS web page.

All submissions must be uploaded electronically at


  • General co-chairs: Henri Bal (Vrije Universiteit, The Netherlands) and Arindam Pal (TCS Innovation Labs, India)
  • Technical Program co-chairs: Azalia Mirhoseini (Google Brain, USA) and Thomas Parnell (IBM Research – Zurich, Switzerland)
  • Publicity chair: Yanik Ngoko (Université Paris XIII, France)
  • Steering Committee: Sutanay Choudhury (Pacific Northwest National Laboratory, USA), Anand Panangadan (California State University, Fullerton, USA), and Yinglong Xia (Huawei Research America, USA)

Technical Program Committee

  • Vito Giovanni Castellana, Pacific Northwest National Laboratory, USA
  • Tanmoy Chakraborty, IIIT Delhi, India
  • Sutanay Choudhury, Pacific Northwest National Laboratory, USA
  • Erich Elsen, Google Brain, USA
  • Dinesh Garg, IIT Gandhinagar and IBM Research, India
  • Kripabandhu Ghosh, IIT Kanpur, India
  • Saptarshi Ghosh, IIT Kharagpur, India
  • Kazuaki Ishizaki, IBM Research - Tokyo, Japan
  • Debnath Mukherjee, TCS Research, India
  • Francesco Parisi, University of Calabria, Italy
  • Saurabh Paul, PayPal, USA
  • Jianting Zhang, City College of New York, USA

Past workshops