An 8TB corpus of permissively licensed text for training AI models. Comes with two 7B models trained as proof-of-concept.
https://github.com/r-three/common-pile/blob/main/paper.pdf
I'm glad somebody has finally done this. The "we need to break copyright or AI won't work" argument feels super-dodgy to me and we can just evaluate whether it's true. 1/n
I talked about this earlier here.
https://sigmoid.social/@pbloem/111096112737456477
It's not a full saving grace for copyright, because most of this data is ShareAlike. That means that if the model spits out something that is too close to the training data for copyright purposes, you still need to attribute.
That might be hard to detect at scale, but it might be solvable.
They train a 7B model on a 600GB subsample of the training data, which looks pretty competetive with models of a similar size.
This is a data volume that is similar to GPT-3, but with a smaller model (but then GPT3 probably didn't need to be as big as it was).
Scaling this up, it seems we could go to at least 100B models with the data they collected.
So, where are we on the question I asked in Sept 2023? Can we beat GPT-4's performance on open data (or is it feasible)?
The MMLU performance is just over 40%. GPT-4 reported 84% on release. However, GPT is estimated to be about 200B params. What's more, in the second model they train, doubling the number of tokens form 1T to 2T, the performance jumps to ~50%.
In short, we can't call it yet, but we can't rule it out either. My guess is it should at least be possible to get close to 84% MMLU.
You might say, so what, this approach is always going to lose out to those who use closed data _as well_.
But that ignores a crucial thing that we didn't have two years ago: self-improvement operators.
We now know that we can force models to "reason" and to reflect to improve their answers. Yes, you can train on terabytes more closed source data, but you can also spend that training budget sampling chain-of-thought and training it back into the model.
From the toy experiments I've seen, deepseek-style reasoning traces work on 8B models. So this amount of data is at least enough to get us to the point where things like that start to work (as well as instruction tuning generalization).
So, on balance, I think the copyright argument will probably disappear soon. We will focus on bootstrapping from data like this (and equivalent efforts in other domains), and then use self-reflection to train from there.
Everything else that requires copyrighted content will be done in-context with RAG and other methods.
Some technical details. They unsurprisingly lean on wikimedia sources.
They strip the wiki markup, which seems a waste. There is a lot of information in there. I would have at least converted to markdown to preserve the basic structural annotation.
I'm curious whether they used talk pages too. There may be more data out there than they used.
They were pretty diligent in cleaning the data, I'd say. They didn't just use wikipedia, but they made efforts to remove copyrighted data from it like lyrics.
https://github.com/r-three/common-pile/tree/main/sources/wiki
From this discussion, it seems like talk pages are included.
@pbloem It’s not ShareAlike licence that’s the problem, it’s the Attribution criterion, which applies to all the CC BY as well as CC BY-SA content (including all of Wikipedia). Correctly attributing any chunks of text that come or are closely-derived from Wikipedia will be difficult, as will forcing users to use an open licence for any derivative work they make with the LLM (if it’s even considered copyrighted to them).
@Adzebill You're right, It's the BY not the SA that I was talking about.
Although SA does also mean that if you build a bot, anything it spits out should at the very least be CC licensed.
@pbloem Not just at the very least, it’s legally required to use that same CC BY-SA (and credit the copyright owner). This corpus seems like an unworkable mess.
@Adzebill You're right, this is messier than I thought. I think in practice it boils down to the following questions.
1) When is an AI model itself a derivative work?
2) When is the output generated by a model a derivative work.
(2) is only a problem for production systems, but (1) is relevant for distributing any trained model, which is more or less required for open science.
@Adzebill In both cases, it's clearly a spectrum.
I can do a word frequency count on Wikipedia data, and nobody would claim I'd need to license it under SA to distribute it, but I could also train a model to memorize the data exactly, which would be the same as distributing it verbatim.
@pbloem I think this is getting needlessly esoteric. Why not start with “what is a work?”, then “What is a copyrightable work”, and then “What is a derivative of a copyrightable work?” All these are pretty well defined legally. I think whether the tool commits plagiarism is a far more fruitful question than whether the tool itself is plagiarism. https://en.wikipedia.org/wiki/Derivative_work
@Adzebill There are two reasons to focus on the models. First, that is what they're distributing now (I think under an incompatible license).
Second, this is the minimal use case that makes the dataset useful. As a scientist, I don't need to worry about a model spitting out copyrighted stuff if I never put the model into production. I do, however, need to distribute the model itself, to make my work reproducible.
Right now, the only way we can do science on LLMs is to use models that have been trained on unlicensed data. This sucks, since most scientists don't want to disrespect copyright, but we also need to study this new technology and the claims that people are making about it.
That makes this dataset, and their models a godsend, even if it still doesn't guarantee that you can easily build a working production bot on top of it.
@pbloem That makes it clearer, I see where you’re coming from.
@pbloem It's very strange to consider IRC logs as "public domain", first I didn't saw a mention about the license on the website they gave and secondly that's the people talking who must decide if what they produce is in public domain, not the IRC manager (or a paper on a Github).
@sebbernery It looks like these are (more or less) the terms of use of the IRC channel.
I don't know if people are warned about this when signing on to the channel, but the warning on the website seems to be directed at users before they connect.
@sebbernery Biderman mentioned something on the discord about needing more lawyers if they go multilingual in the future, so I do think they're checking the legality carefully.