Author(s):
Tavares, Ana Helena ; Silva, Ana ; Freitas, Tiago ; Costa, Maria ; Macedo, Pedro ; Costa, Rui A. da
Date: 2025
Persistent ID: http://hdl.handle.net/10773/46060
Origin: RIA - Repositório Institucional da Universidade de Aveiro
Subject(s): Big data; Collinearity; Maximum entropy; Regression modelling
Description
Despite the advances on data analysis methodologies in the last decades, most of the traditional regression methods cannot be directly applied to large-scale data. Although aggregation methods are especially designed to deal with large-scale data, their performance may be strongly reduced in ill-conditioned problems (due to collinearity issues). This work compares the performance of a recent approach based on normalized entropy, a concept from information theory and info-metrics, with bagging and magging, two well-established aggregation methods in the literature, providing valuable insights for applications in regression analysis with large-scale data. While the results reveal a similar performance between methods in terms of prediction accuracy, the approach based on normalized entropy largely outperforms the other methods in terms of precision accuracy, even considering a smaller number of groups and observations per group, which represents an important advantage in inference problems with large-scale data. This work also alerts for the risk of using the OLS estimator, particularly under collinearity scenarios, knowing that data scientists frequently use linear models as a simplified view of the reality in big data analysis, and the OLS estimator is routinely used in practice. Beyond the promising findings of the simulation study, our estimation and aggregation strategies show strong potential for real-world applications in fields such as econometrics, genomics, environmental sciences, and machine learning, where data challenges such as noise and ill-conditioning are persistent.