RatioLogo
Back

Big Data Breakthrough for Econometrics

Economists can now better analyze huge datasets, thanks to a new scalable tool for logistic regression developed for Apache Spark. This groundbreaking research tackles the challenge of using traditional statistical methods with "big data"—exceptionally large and complex datasets.

The Challenge with Big Data

Traditional statistical tools like Stata, R, and Python simply aren't suitable when dealing with millions of observations. The research team focused on logistic regression, a fundamental statistical technique used to model the probability of an event occurring.

Overcoming Limitations: The Research Approach

The scientists approached this problem by:

  1. Testing Spark's Built-in Functions: They began by evaluating Apache Spark's existing logistic regression functions.
  2. Developing a Custom Package: They introduced their own specialized package to generate essential statistical summaries, crucial for drawing reliable conclusions.
  3. Optimizing Calculations: They meticulously fine-tuned how key calculations were performed within Spark to ensure maximum efficiency.

Real-World Application and Findings

The researchers used an astonishing real-world dataset comprising 100 million observations and 140 variables to validate their tool. They also created artificial datasets to confirm their findings. They compared three methods: GLM from R, and two Spark functions, GLR and LR.

They discovered that Spark's LR function performed significantly better than GLR, especially in cases of "quasi-multicollinearity" (when variables in a statistical model are so highly correlated that it destabilizes the model and makes it difficult to interpret individual variable effects).

Example of LR Function Accuracy:
The LR function delivered accurate results even with complex data, demonstrating unbiased estimators for datasets with independent columns and balanced outcomes. For instance, the LR method exhibited errors with a tiny standard deviation of 0.380, compared to a colossal 2.91 * 10^15 for GLR.

Future Directions

While the current tool marks a significant advancement, the authors noted that building it directly in Scala (Spark's native language) could further enhance its performance. Future work could also explore the application of other statistical models on big data platforms.

This new scalable tool empowers researchers to unlock more insights from the ever-growing mountains of data, offering a clearer picture of complex economic trends.


Source:
Ouattara, A., Bulté, M., Lin, W. J., Scholl, P., Veit, B., Ziakas, C., ... & Dikos, G. (2021). Scalable Econometrics on Big Data – The Logistic Regression on Spark. arXiv preprint arXiv:2106.10341.