Modelling Task Durations Towards Automated, Big Data, Process Mining – lay abstract

The lay abstract featured today (for Modelling Task Durations Towards Automated, Big Data, Process Mining by Malcolm Faddy Lingkai Yang, Sally McClean, Mark Donnelly, Kashaf Khan and Kevin Burke) is from Applied Stochastic Models in Business and Industry with the full article (Open Access) now available to read here.

How to Cite

Faddy, M., Yang, L., McClean, S., Donnelly, M., Khan, K. and Burke, K. (2025), Modelling Task Durations Towards Automated, Big Data, Process Mining. Appl Stochastic Models Bus Ind, 41: e2933. https://doi.org/10.1002/asmb.2933

Lay Abstract

Business process data often comes in vast, complex streams that need to be summarised and explained effectively to unlock their full value. Automated analysis of such data can quickly generate useful summary statistics but the diversity within the data requires models to describe clusters in the data, facilitating location of modes (or cluster centres) and regions of randomness. These features are incorporated into the models, which can then be fitted to the data using readily available algorithms.

With datasets or samples often containing tens or even hundreds of thousands of data points, a key challenge lies in determining whether additional modes are required (statistically significant). A number of assumptions underpin standard methods of statistical inference; among these is independence between different data values with even small correlations in large datasets undermining their reliability. This phenomenon, known as the big data paradox, shows that more data doesn’t necessarily mean more information with rapidly diminishing returns apparent from sample sizes substantially more than 1000 or so.

Information criteria, which compare the information in the data with that captured by the fitted model, provide a means of discriminating between models of increasing complexity. An enhancement of the Bayesian Information Criterion, which allows greater weight to be given to the sample size is proposed. This is shown through simulations and examples, using real data, to address the big data paradox while retaining some statistical authenticity.

These considerations represent steps toward creating a fully automated analysis package for streamlined and accurate business data interpretation. And, more generally, raise concerns about the value of simple inferences made from very large datasets.

 

More Details