Parallelizing GPU-based Mini-Batch Graph Neural Network Training

Speaker:

Marco Serafini

Data dell'evento:

Wednesday, 3 July, 2024 - 11:30

Luogo:

Aula Magna DIAG, Via Ariosto 25

Contatto:

Leonardo Querzoni <querzoni@diag.uniroma1.it>

Abstract: Many datasets are best represented as graphs of entities connected by relationships rather than as a single uniform dataset or table. Graph Neural Networks (GNNs) have been used to achieve state-of-the-art performance in tasks such as classification and link prediction. This talk will discuss recent research on scalable GNN training.

The talk will focus on the popular mini-batch approach to GNN training, where each iteration consists of three steps: sampling the k-hop neighbors of the mini-batch, loading the samples onto the GPUs, and training. The first part of the talk will discuss NextDoor, which showed for the first time that we can significantly speed up end-to-end GNN training by using GPU-based sampling. To maximize the utilization of GPU resources and speed up sampling, NextDoor proposes a new form of parallelism, called transit parallelism. The second part of the talk focuses on a new approach called split parallelism to run the entire mini-batch training pipeline on GPUs. It presents a system called GSplit that avoids redundant data loads and has all GPUs perform sampling and training cooperatively on the same GPU. Finally, the last part of the talk will discuss results from an experimental comparison between full-graph and mini-batch training systems.

Short Bio: Marco Serafini is an assistant professor at the Manning College of Information and Computer Sciences at UMass Amherst. He works on systems for graph learning, mining, and data management (e.g., the Arabesque, LiveGraph, NextDoor, and GSplit projects), cloud data management systems, (e.g., Accordion, E-Store, and Clay), and big-data systems, including contributions to the Apache Zookeeper and Storm projects. He was on the Program Committees of major conferences in systems and database management, including SOSP, OSDI, Eurosys, SIGMOD, ASPLOS, VLDB, and ICDE, among others, the Program Chair of the LADIS and APSys workshops, and AE for SIGMOD.

gruppo di ricerca:

Cybersecurity

keywords:

Artificial Intelligence and Robotics
Neural Networks and Support Vector Machines
Parallel and Distributed Computing Platforms