Synthetic data are not the solution.
"Once the training began, researchers discovered a problem in the data: It wasn’t as diversified as they had thought, potentially limiting how much Orion would learn. "
This is beyond obvious from a statistical perspective.
@cigitalgem I'm thinking about how naive (and/or desperate) these statements make them look. Aren't these guys supposed to be "AI industry leaders"? One would think they'd be able to realize the limited applicability of synthetic data to the real world, not to mention the issue of recursive pollution to top it off
@elias_sorensen I had some interesting talks with the synthetic data guys in the fall. They were delusional and did not listen to reason.
@cigitalgem some of these guys are just chasing their own tails in their hype echo chamber at this point. While the transformer architecture stuff is cool, I wonder how overemphasis on LLMs will delay other necessary advancement in the field of data science. Also people getting excited by the benchmarks is driving me nuts. How do we know this isn't just overoptimization for these benchmarks...
@elias_sorensen @cigitalgem I prefer the term "high on their own supply".
@cigitalgem @dalias there was also the one about medicine recently. Here, let's go ahead and reduce the entire field of medicine to a quiz of 143 diagnoses
@elias_sorensen @dalias yep. Ridiculous claims. Often in cases like these, the SOTA benchmark is in the damn training set!
@elias_sorensen yes. Start with @melaniemitchell on this front and chase the papers: https://arxiv.org/pdf/2210.13966.pdf
(I think that's the right one.)
@cigitalgem @elias_sorensen They fired and launched smear campaigns against anyone who wasn't delusional..