RatioLogo
Back

Scientists Discover Better Way to Shrink Digital Trees

New measure packs XML data tighter than ever.

Scientists have found a new way to measure how much digital "tree" data can shrink, leading to super-compact storage.

This research tackled a big question: How can we squeeze more information into less space when dealing with tree-like digital structures, like the ones used for web pages or databases? These "trees" are a bit like family trees, showing how different pieces of information connect.


The Experiment: Measuring Data Shrinkage

The team looked at 13 real XML files. XML—or Extensible Markup Language—is a way to encode documents in a format that is both human-readable and machine-readable.
They took the tree structures from these files and stripped away the text.

Then, they calculated various "entropy" measures. Think of entropy here as a way to see how much "disorder" or randomness is in the data, which tells you how much it can be compressed. Less disorder means more compression.

The Winning Measure: Hk(t)

The main result was a clear winner: a measure called "kth-order label-shape entropy," or Hk(t).
For all the XML trees they looked at, and for all k values greater than 0, Hk(t) proved "significantly smaller" than other combinations of entropy measures. This means Hk(t) found more hidden patterns, allowing for better compression.

Why Hk(t) is Superior

Author Markus Lohrey highlighted that Hk(t) is:

"the most effective measure in capturing regularities in both labeled and unlabeled trees."

It even helped tell apart unlabeled trees of the same size, which other measures couldn't do.


Implications for Our Digital World

This discovery matters because it points to how we can build more efficient data structures for the vast amount of tree-structured data we create every day.

Imagine:

  • Your computer or phone needing less space to store the same amount of information.
  • Websites loading faster because their underlying data is more compact.

Future Work

However, the study notes that while Hk(t) sets a strong benchmark for compression, there isn't yet a perfect data storage method that instantly uses this measure for quick look-ups.

Future work will focus on:

  • Creating such data structures.
  • Exploring how Hk(t) performs with different types of tree data.

Ultimately, Hk(t) offers a powerful new lens for understanding and shrinking our digital world.

Reference:

Danny Hucke, Markus Lohrey, and Louisa Seelbach Benkner. "A Comparison of Empirical Tree Entropies." arXiv preprint arXiv:2006.01695 (2020).