sigmoid.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A social space for people researching, working with, or just interested in AI!

Server stats:

594
active users

Peter Bloem

New pre-print!

**Universal pre-training by iterated random computation.**

⌨️🐒 A monkey behind a typewriter will produce the collected works of Shakespeare eventually.

💻🐒 But what if we put a monkey behind a computer?

⌨️🐒 needs to be lucky enough to type all characters of all of Shakespeare correctly. 💻🐒 only needs to be lucky enough to type a program for Shakespeare.

This suggests that passing random noise through random computation _enriches_ it. (1/n)

arxiv.org/abs/2506.20057

The idea is that we can pre-train an LLM on this data and get a training benefit before we see our real data.

This may seem like an impossible dream. How can we train before seeing data? Doesn't that violate No Free Lunch?

However, we show theoretically that this approach is an approximation to predicting with the universal distribution, i.e. Solomonoff induction, a well-established universal prediction algorithm.

In practice, we generate random tokens and pass them (repeatedly) through randomly initialized LSTMs.

We evaluate zero-shot on various downstream tasks including English Wikipedia, German text and Code after every 100 000 instances of randomly generated data.

As we train, the model show better than chance performance (8 bits) across the board. For the real-world data, the model substantially beats the performance of an (in-context) Markov model.

After training, we finetune on real-world data. We observe that the models that have been pre-trained with noise converge very quickly compared to a baseline which is trained from scratch.

Moreover, on the other datasets, the UP models retain their zero-shot performance during finetuning. This suggests that there may be a generalization benefit to using a UP model.

All this is at the expense of much longer training, but that cost can be amortized over many tasks.

Long story short, this type of mechanism allows for a data/compute tradeoff. We can train models at the same quality with less data by investing more compute (and thus energy).

This is all very early days, and the tradeoff may never be worthwhile, but if it is, there are some serious questions we should ask ourselves.

Third option, left to future work.