Federated Learning (FL) enables collaborative training of Machine Learning (ML) models across
decentralized clients while preserving data privacy. One of the challenges that FL faces is when the
clients’ data is not independent and identically distributed (non-IID). It is, therefore, crucial to quantify
how non-IID data impacts performance. However, due to the limited number of federated data available,
it is not easy to carry out real-world simulations. In this work, we propose for the first time (1) the
Hist-Dirichlet-based and Min-Size-Dirichlet methods for partitioning data into multiple nodes using the
features and quantity distribution and the Dirichlet distribution. We use the (2) Jensen-Shannon and
Hellinger distances for quantifying the degree of IID data. Moreover, we implemented (3) state-of-the-art
partitioning methods based on the labels’ distribution across clients. All our proposals are open-source in a
library called FedArtML, publicly available on PyPI. It facilitates research on cross-silo and cross-device
FL, allowing a systematic and controlled partition of centralized datasets using the label, features, and
quantity skewness. To demonstrate the value of our proposed methods and the robustness of FedArtML,
we experimented in the ECG arrhythmia detection field with Physionet 2020 data. Our results demonstrate
that our tool generates federated datasets for multi-client model training and accurately measures client
distribution heterogeneity. Our approach achieves 48% higher non-IID-ness than existing feature skew
methods, providing more granularity. Furthermore, we validate our simulated federated datasets against
real-world data, revealing only a 2% F1-Score difference, affirming the method’s real-life applicability.
Dettaglio pubblicazione
2024, IEEE ACCESS, Pages -
FedArtML: A Tool to Facilitate the Generation of Non-IID Datasets in a Controlled Way to Support Federated Learning Research (01a Articolo in rivista)
Jimenez Daniel, Anagnostopoulos Aris, Chatzigiannakis Ioannis, Vitaletti Andrea
Gruppo di ricerca: Computer Networks and Pervasive Systems
keywords