State-of-the-art approaches for managing Big Data pipelines assume their anatomy is known by design and expressed through ad-hoc Domain-Specific Languages (DSLs), with insufficient knowledge of the dark data involved in the pipeline execution. Dark data is data that organizations acquire during regular business activities but is not used to derive insights or for decision-making. The recent literature on Big Data processing agrees that a new breed of Big Data pipeline discovery (BDPD) solutions can mitigate this issue by solely analyzing the event log that keeps track of pipeline executions over time. Relying on well-established process mining techniques, BDPD can reveal fact-based insights into how data pipelines transpire and access dark data. However, to date, a standard format to specify the concept of Big Data pipeline execution in an event log does not exist, making it challenging to apply process mining to achieve the BDPD task. To address this issue, in this paper we formalize a universally applicable reference data model to conceptualize the core properties and attributes of a data pipeline execution. We provide an implementation of the model as an extension to the XES interchange standard for event logs, demonstrate its practical applicability in a use case involving a data pipeline for managing digital marketing campaigns, and evaluate its effectiveness in uncovering dark data manipulated during several pipeline executions.
Dettaglio pubblicazione
2023, International Conference on Business Process Management, Pages 38-54 (volume: 490 LNBIP)
A Reference Data Model to Specify Event Logs for Big Data Pipeline Discovery (04b Atto di convegno in volume)
Benvenuti D., Marrella A., Rossi J., Nikolov N., Roman D., Soylu A., Perales F.
ISBN: 978-3-031-41622-4; 978-3-031-41623-1
keywords