The LLM Compression Gap

The relationship between compression and intelligence has fascinated researchers for decades. The Hutter Prize embodies this connection, offering €500,000 for better Wikipedia compression. But there's a fundamental gap between what works in theory and what the prize constraints allow.

Visit the Hutter Prize website

I first encountered the Hutter Prize while researching compression algorithms. €500,000 for compressing Wikipedia better than anyone else had managed. The premise was elegant: compression equals prediction equals intelligence. If you can compress data better, you understand it better.

If modern language models are getting eerily good at predicting what comes next in text, and compression is fundamentally about prediction, shouldn't LLMs be absolutely crushing traditional compression algorithms? During a long moped ride through northern Vietnam, I found myself thinking about this connection more seriously.

A Simple Idea

The idea seemed straightforward: instead of storing actual words, what if you stored their likelihood according to an LLM? If the model predicts "dangerous" is the 3rd most likely next word, just store "3" instead of "dangerous."

The Core Idea

High-probability words get low ranks (0, 1, 2) which compress extremely well. Unexpected words get high ranks, but that's okay because they're rare. The better your language model understands the text, the more often it predicts correctly, yielding better compression.

As most good daydream ideas go, if it seems obvious, someone has probably tried it. I was glad to discover there was indeed research in this direction.

LLMZip: The Idea in Practice

A quick search revealed LLMZip and FineZip - papers that had implemented exactly this approach. The results were impressive.

Compression Performance Comparison

Method	Bits/Character	Compression Time	Dataset	Year
LLMZip (LLaMA-7B + AC)	0.636	~227 hours (10MB)	enwik8	2023
FineZip	~0.64	~4 hours (10MB)	enwik8	2024
ts_zip (Bellard)	1.084	~minutes	enwik9	2023
fx2-cmix (Hutter Prize)	0.944	~50 hours	enwik9 (1GB)	2024
Traditional (zlib)	~2.8	seconds	various	-

The LLM methods weren't just better - they were dramatically better. LLMZip achieved 0.636 bits per character, compared to the current Hutter Prize winner at 0.944. That's approaching Shannon's theoretical limit for English text entropy (~0.6-1.3 bpc).

The Constraint Problem

Then I read the Hutter Prize rules more carefully. The constraints were extremely restrictive.

Hutter Prize Constraints vs LLM Requirements

Resource	Hutter Prize Limit	FineZip Needs	Gap
Time	70,000/T hours*	~4 hours (10MB)	Still 80x too slow
Memory	10GB RAM	~13GB+ (LLaMA-7B)	Won't fit
GPU Usage	Not allowed	Essential for speed	Impossible
Dataset Size	1GB (enwik9)	Works, but limited	Scale mismatch

*T = Geekbench5 score. Test machines: Intel i7-1165G7 (T≈1427) = ~49 hours, AMD Ryzen 7 3.6GHz (T≈1310) = ~53 hours

The math was still brutal. Even with FineZip's 54x speedup over LLMZip, it needed 4 hours just for 10MB. For the full 1GB enwik9 dataset, you'd need roughly 400 hours - about 8 times the Hutter Prize limit, a significant improvement but still prohibitive.

Community Perspective

Digging deeper, I found a prescient Reddit discussion from 2020 that perfectly captured the problem:

The prize has been largely useless because the resource constraints are many orders of magnitude too severe, and the dataset too tiny.

The overall thesis that prediction=intelligence has been very strongly vindicated by, most notably recently in scaled-up language models trained solely with a self-supervised prediction loss who have near-perfect correlation of their perplexity/BPC compression performance with human-like text generation and benchmarks... but not a single SOTA of interest can be trained or run within the original Hutter Prize constraints.

Genuine intelligence requires far more resources and data to amortize itself over than the HP provides. The only things that run within those constraints are things like PAQ8, which are too slow to be of any ordinary software engineering interest... and yet, too cheap to be anything but completely useless to AI/ML research.

— r/MachineLearning discussion

This crystallized the fundamental tension: the Hutter Prize validates compression=intelligence in theory, but constrains it so severely that actual intelligent systems can't participate.

Potential Path Forward

There might be one viable path forward. The current Hutter Prize winner fx2-cmix includes sophisticated preprocessing with natural language processing features like stemming and neural network components including LSTM-based context mixing. However, the extent to which modern LLM embeddings are currently used requires further investigation.

The Preprocessing Opportunity

What if tiny, specialized language models (10-100M parameters) could enhance the semantic understanding in the preprocessing stage? Recent research on neural scaling laws shows these models can retain ~95% of performance through knowledge distillation while fitting easily within memory constraints.

Current compression algorithms like fx2-cmix already incorporate neural networks and semantic processing within the strict computational limits, but there may still be opportunities to improve upon their existing approaches with more targeted models.

Scaling Laws and Future Directions

Understanding how model performance scales with size is crucial here. Research on neural scaling laws suggests that even tiny models can capture substantial semantic understanding when trained properly.

The key insight: you don't need GPT-4 scale to get useful semantic compression gains. A 50M parameter model specialized for Wikipedia compression might provide enough semantic boost to edge out the current winner, while still fitting within the absurd constraints.

Next Steps: Getting Closer to the Machine

I'm planning to dive deeper into how current compression algorithms actually work. After recently working with cloud-based AI systems churning out endless content, there's something appealing about working within real constraints for once. The Hutter Prize forces you to think carefully about every bit, every cycle, every byte of memory.

My plan is to first understand fx2-cmix's implementation - how exactly does its neural network preprocessing work? What semantic features is it already capturing? Then I'll explore whether there are opportunities to integrate small, specialized models within those brutal constraints.

A Different Kind of Challenge

Instead of throwing computational resources at problems, this requires genuine optimization and understanding. It's a refreshing change from the "just add more GPUs" approach that dominates modern AI.

The Hutter Prize constraints may be anachronistic, but they create an interesting sandbox for exploring the boundaries between statistical optimization and semantic understanding. Plus, there's something satisfying about trying to squeeze every bit of performance from a system that has to fit in 10GB of RAM.

I have a lot of respect for Marcus Hutter's rationale behind keeping these constraints so restrictive. As he explains on the website:

What if I can (significantly) beat the current record?

In this case, submit your code and win the award and/or copyright your code and/or patent your ideas. Also note that a provisional patent application is quite cheap. Thereafter you can enter the competition and fund the expensive non-provisional patent application with the prize money. You should be able to monetize your invention beyond the HKCP. This happened to the first winner, a Russian/Ukrainian who always had to cycle 8km to a friend to test his code because he did not even have a suitable computer, and who now has a lucrative job at QTR in Canada... The mp3 patent (the most famous lossy compressor for music) for instance, made millions of dollars from licensing fees. If your compressor is revolutionary, say beats current SOTA by over 30%, this is most likely due to a mistake or misunderstanding or violation of the rules.

— Hutter Prize FAQ

It's a strange approach to encouraging AI development, but there's wisdom in it. The constraints force genuine innovation rather than just throwing more compute at the problem.

In my degree right now, all the AI we're covering is old school - we haven't been doing any work with neural networks or transformers. It's all pre-LLM stuff. With the way I'm wired, I find it hard not to take the shortest path to the goal, so there's a certain romanticism I find in wondering what things would be like if the constraints were harder.

I'll be wielding my army of AI agents to help build frameworks to understand this problem, create the environments and GUIs to assist me like I do with everything now. But hopefully it'll hit a deadlock pretty quickly, and I'll be forced to learn something before I become totally dependent on AI.