Spark Cracks Big Data Code

New toolkit tames massive datasets for better predictions.

Scientists have developed a new computer toolkit to analyze colossal datasets, making sense of information that previously stumped traditional methods.

The research tackled a common hurdle: making sense of "big data"—those immense piles of information far too large for regular computers. Traditional tools, like those used for statistics, just aren't built for that scale.

Researchers aimed to build a powerful new tool for spotting patterns in big data, specifically focusing on logistic regression [a statistical method used to predict an outcome with two possibilities, like "yes" or "no," based on other factors]. They sought to check how well existing Spark functions, which are like pre-built parts of a computer program, performed and then create a new package for detailed statistical summaries.

Putting the Toolkit to the Test

The new toolkit was rigorously tested on a dataset with a whopping 100 million observations and 140 variables. Additionally, they used fake data designed to have known answers (like a practice quiz) to double-check their methods.

The team compared two Spark functions:

GeneralizedLinearRegression (GLR)
LogisticRegression (LR)

These were benchmarked against a standard R program, glm. They also built their own statistical summary tool for logistic regression using PySpark.

Key Findings

The results showed that the LR function was a true workhorse. It performed much better than GLR, especially when data had:

High correlation: Where different data points are closely related (like temperature and ice cream sales).
Quasi-multicollinearity: A complex statistical issue where variables are almost perfectly related, making it hard to pinpoint individual effects.

Every coefficient (a number that shows the strength and direction of a relationship) from the LR method fell within the expected range, matching the results from the trusted glm program. In contrast, GLR struggled, showing errors as large as 2.91 quadrillion. LR's errors were tiny by comparison, with a standard deviation of only 0.380.

"The LR function is found to be more robust than GLR, especially under high correlation or quasi-multicollinearity."
— Authors

This indicates that LR is more dependable, even when the data is tricky. Their new package also showed impressive throughput (how much data it can process), increasing almost exponentially as the dataset grew.

Future Outlook

While promising, the team noted their tool could be even faster if built in Scala, a language more "native" to Spark. They also faced some challenges with Spark's "expressiveness" (how easily one can write complex commands) and overall performance.

This new package is a first step toward making large-scale statistics possible for economic data. Moving forward, researchers might explore other powerful computing frameworks like TensorFlow and Dask for even bigger challenges.

This breakthrough brings us closer to unlocking the hidden insights within the vast ocean of big data.

Reference:
Ouattara, A., Bulté, M., Lin, W. J., Scholl, P., Veit, B., Ziakas, C., ... & Dikos, G. (2021). Scalable Econometrics on Big Data – The Logistic Regression on Spark. arXiv preprint arXiv:2106.10341.