ASMBI: Special Issue on Data Science in Business and Industry

Applied Stochastic Models in Business and Industry has just published a Special Issue on Data Science in Business and Industry, guest edited by David Banks (david.banks@duke.edu), Alba Martínez-Ruiz (alba.martinez.ruiz@gmail.com), David F. Muñoz (davidm@itam.mx) and Javier Trejos-Zelaya (javier.trejos@ucr.ac.cr).

Statistical learning and stochastic modelling are the engines of data science. The models and methods help businesses leverage data to make better decisions and use resources more efficiently. Information and communications technologies have forever changed business planning and have led to the development of new industries. Companies use data science to manage supply chains, support dynamic pricing, optimize delivery, and control robotic manufacture. Data science also opens the door to greener industries with less pollution and more fuel efficiency.

This special issue collected high-quality contributions on a wide range of theoretical and applied topics in data science for business and industry. The call for papers attempted to reach a wide audience and included a special call to professors and researchers who presented an abstract at the International Conference on Data Science ICDS2023, held from 8 to 10 November 2023, at the Universidad Diego Portales in Chile.

With 11 papers, this special issue includes applications of simulation-based and multi-objective optimization, kernel smoothing and random forests, hierarchical time series forecasting, Long Short-Term Memory, latent Dirichlet allocation, Bayesian linear models, probability distributions, and support vector machines. The papers examine real-world applications across various industries such as grocery stores, assembly plants, television advertising, internet activity, online markets, industrial recommendation systems, electricity consumption, and the agriculture sector.

The first paper is by Vega, Musolesi, O’Sullivan, Prior, and Manolopoulou. It applies latent Dirichlet allocation to big data on grocery store purchases. The paper finds that there are regional “topics” in such shopping and then does a spatial analysis using Gaussian process regression to show that stores that are near to each other show similar patterns.

The assembly (maquiladora) industry is addressed in the article of Moncayo-Martínez, Naihui He, and Arias-Nava. The authors present a deterministic mathematical model to minimize the cycle time in a stochastic assembly line balancing problem, implemented using the Gurobi solver. The authors match simulation and simulation-based optimization to measure the impact of selected sources of uncertainty and to maintain an average cycle time close to the deterministic one. The approach may be useful in manufacturing, transportation, logistics, scheduling algorithms, and project management, and the paper illustrates the new methodology through a case study in a motorcycle manufacturing company.

Veverka and Holý examine the relationship between television advertisement and internet activity. If an ad is shown at 9:00 p.m., does that drive a spike in internet searches, and perhaps purchases, in the following hour? Their analysis uses kernel smoothing to account for time-of-day and seasonal effects, and random forests to measure the impact of ad position, TV channel, and other explanatory variables.

Online markets also arise in the paper by Hirose, Daigaku, and Gakubu. They studied Japanese resale regulations during the pandemic, when people were tempted to buy up and resell anhydrous ethanol, used for disinfection. People are complex, as are laws governing profiteering and hoarding, but the paper finds interesting insights that should inform future market control.

The paper of Zito, Greaves, Soriano, and Richardson examines North Star metrics and online experimentation. They address critical issues such as the low sensitivity of the metrics and short- and long-term differences. Applying multi-objective optimization, the authors propose a Pareto optimal proxy metrics method which simultaneously optimizes prediction accuracy and sensitivity. Experimenting with a large industrial recommendation system, the authors found proxy metrics to be eight times more sensitive than North Star metrics.

Hierarchical time series forecasting is examined by Cabreira, Silva, Cordeiro, Tolentino, Carbo-Bustinza, Rodrigues, and López-Gonzales. The important problem of electricity consumption in the Brazilian industrial sector is studied by matching bottom-up, top-down, and optimal combination approaches with three predictive models: exponential smoothing, Box and Jenkins, and Long Short-Term Memory. The bottom-up approach showed the highest forecasting accuracy via neural networks, enabling useful short-term projections of electricity consumption.

The paper by Zhang, Kyo, Hachiya, and Noda models the shapes of Chinese yams, so that automated cutting machines can carve them up to serve as seeds more efficiently. The primary tool is a remarkably sophisticated Bayesian linear model, and an empirical comparison shows the superior performance of their proposed technique.

The paper by Gabauer, Gupta, Karmakar, and Nielsen uses data of gold returns from G7 and BRICS stock markets, the Log-Periodic Power Law Singularity (LPPLS) model, and the Multi-Scale LPPLS Confidence Indicators (MS-LPPLS-CIs) to characterize positive and negative bubbles at different time scales, with the aim of forecasting weekly gold returns. The authors found that the MS-LPPLS-CIs, particularly when both positive and negative bubbles indicators are considered simultaneously, can accurately forecast gold returns in the short- to medium-term.

The article by Banks and Li discusses how statisticians need to adapt to contribute to the new business model suggested by the so-called knowledge economy, with particular emphasis upon computational advertising, autonomous vehicles, large language models, and operations management. The authors found that many of our old tools are still relevant, even as the new problem space poses fresh research challenges for our economic and educational systems.

In the paper of Faddy, Yang, McClean, Donnelly, Khan, and Burke, the authors propose a new mixture model to represent task duration data and duration heterogeneity. The approach comprises gamma, uniform, and exponential distributions that allow for both peaked and flatter components, providing a general framework for constructing multimodal, skewed, and long-tailed probability distributions. To illustrate, the authors apply the new methodology to hospital billing data sets and automated test data from a telecom company.

Finally, feature selection for stock movement direction prediction using sparse support vector machine is addressed by Miao, Wu, Cai, Fu, Zheng, and Wang. The authors propose a new sparse SVMs framework based on the recursive feature elimination combined with the ReliefF algorithm, and a new filter algorithm that can capture relevant features individually and feature interactions simultaneously.